Some questions for the Data Science interview #01
Hey guys, this particular post might be helpful for the interview process related to the data science domain,
What do you think the distribution of time spent per day on Facebook looks like? What metric would you use to describe the distribution?
In terms of the distribution of time spent per day on Facebook (FB), one can imagine there may be two groups of people on Facebook:
1. People who scroll quickly through their feed and don’t spend too much time on FB.
2. People who spend a large amount of their social media time on FB.
Based on this, we make claim about the distribution of time spent on FB. The metrics to describe our distribution can be
1) Centre (mean, median, mode)
2) Spread (standard deviation, interquartile range
3) Shape (skewness, kurtosis, uni, or bimodal)
4) Outliers (Do they exist?)
We can give you a sample answer for your interview: –
If we assume that a person is visiting the Facebook page, there is a probability(p) that after one unit of time(t) has passed that she will leave the page.
With a probability of p, her visit will be limited to 1 unit of time. With a probability of (1−p)p, her visit will be limited to 2 units of time. With a probability of (1−p)2p, her visit will be limited to 3 units of time and so on. The probability mass function of this distribution is, therefore (1−p)tp, and hence we can say this is a geometric distribution.
What is the difference between Skewness and kurtosis?
The characteristic of a frequency distribution that ascertains its symmetry about the mean is called skewness. On the other hand, Kurtosis means the relative pointedness of the standard bell curve, defined by the frequency distribution.
Skewness is characteristic of the deviation from the mean, to be greater on one side than the other, i.e. attribute of the distribution having one tail heavier than the other. Skewness is used to indicate the shape of the distribution of data. Conversely, kurtosis is a measure to indicate the flatness or peakedness of the frequency distribution curve and measures the tails or outliers of the distribution.
Skewness is an indicator of lack of symmetry, i.e., both left and right sides of the curve are unequal, with respect to the central point. As against this, kurtosis is a measure of data, that is either peaked or flat, with respect to the probability distribution.
Skewness shows how much and in which direction, the values deviate from the mean. In contrast, kurtosis explains how tall and sharp the central peak is.
In a skewed distribution, the curve is extended to either left or right side. So, when the plot is extended towards the right side more, it denotes positive skewness, wherein mode < median < mean. On the other hand, when the plot is stretched more towards the left direction, then it is called negative skewness and so, mean < median < mode. Positive kurtosis represents that the distribution is more peaked than the normal distribution, whereas negative kurtosis shows that the distribution is less peaked than the normal distribution.
How would you build the algorithm for a type-ahead search for Netflix?
We want a recommendation algorithm that sounds like RNN which is a recurrent neural network that is not easy to set up at all, but we can try building recommendation with a much simpler approach that is with a simple prefix matching algorithm and we can certainly go into expanding it on until we have something that will be on par with RNN.
We will use Lookup in this database table/ prefix table. This prefix table starts with an input string and that is your prefix and it will output the suggested string or suffixes. Example: what does the “hello” prefix to and its suffixes to the model.
Scoping is very important by doing fuzzy matching, and context matching like what if you were using a different language. So, if you are trying to input “big” that could output any number of suffixes example: big shot or big sky or the big year etc.
In an existing search corpus of billions of searches what proportion of the time do people writing the big actually click on the big shot and what proportion of the time do they output the big sky, so you can just have a simple thing that has every possible search prefix that has ever been typed on Netflix and output that to the most common thing that they clicked on. Boom! That’s your prefix matching a recommendation algorithm for type-ahead search.
Context matching is also important here if u have string input and a user profile with a various number of features into a string output. You can convert the user profile into a K means clustering we can output this into either John Stamos’s fan and not John Stamos’s fan, so if you are John Stamos’s fan and if you type the big, every time Netflix is going to recommend the big shot or else another way round. Also, the user profile can be set to the right dimensionality.
How to detect an anomaly in a distribution?
Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior. It can be considered the thoughtful process of determining what is normal and what is not. Anomalies are also referred to as outliers, novelties, noise, exceptions, and deviations. Simply, anomaly detection is the task of defining a boundary around normal data points so that they can be distinguishable from outliers.
Anomalies can be broadly categorized as:
Point anomalies: A single instance of data is anomalous if it’s too far off from the rest. Business use case: Detecting credit card fraud based on “amount spent.”
Contextual anomalies: The abnormality is context-specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but maybe odd otherwise.
Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data from a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber-attack.
The different types of methods for anomaly detection are as follows:
Simple Statistical Methods
The simplest approach to identifying irregularities in data is to flag the data points that deviate from common statistical properties of distribution, including mean, median, mode, and quantiles. When an anomalous data point deviates by a certain standard deviation from the mean, then traversing the mean over time-series data isn’t exactly trivial, as it’s not static. Thus, a rolling window to compute the average across the data points and it’s intended to smooth short-term fluctuations and highlight long-term ones.
Machine Learning-Based Approaches for Anomaly Detection:
(a) Clustering-Based Anomaly Detection:
The approach focuses on unsupervised learning, similar data points tend to belong to similar groups or clusters, as determined by their distance from local centroids.
The k-means algorithm can be used which partition the dataset into a given number of clusters. Any data points that fall outside of these clusters are considered anomalies.
(b) Density-based anomaly detection:
This approach is based on the K-nearest neighbors algorithm. It’s evident that normal data points always occur around a dense neighborhood and abnormalities deviate far away. To measure the nearest set of a data point, you can use Euclidean distance or similar measure according to the type of data you have.
(c) Support Vector Machine-Based Anomaly Detection:
A support vector machine is another effective technique for detecting anomalies. One-Class SVMs have been devised for cases in which one class only is known, and the problem is to identify anything outside this class.
This is known as novelty detection, and it refers to the automatic identification of unforeseen or abnormal phenomena, i.e. outliers, embedded in a large amount of normal data.
Anomaly detection helps to monitor any data source, including user logs, devices, networks, and servers. This rapidly helps in identifying zero-day attacks as well as unknown security threats.
What is cross-validation and why would you use it?
Cross-Validation is a statistical method of evaluating and comparing learning algorithms by dividing data into two segments: one used to learn or train a model and the other used to validate the model.
In typical cross-validation, the training and validation sets must cross over in successive rounds such that each data point has a chance of being validated.
The basic form of cross-validation is k-fold cross-validation. Other forms of cross-validation are special cases of k-fold cross-validation or involve repeated rounds of k-fold cross-validation.
In k-fold cross-validation, the data is first partitioned into k equally (or nearly equally) sized segments or folds. Subsequently, k iterations of training and validation are performed such that within each iteration a different fold of the data is held-out for validation while the remaining k − 1 folds are used for learning.
So, why do we use cross-validation?
It allows us to get more metrics and draw important conclusions about our algorithm and our data.
Helps to tune the hyperparameters of a given machine learning algorithm, to get good performance according to some suitable metric.
It mitigates overfitting while building a pipeline of models, such that the second model's input will be real predictions on data that our first model has never seen before.
K-fold cross-validation also significantly reduces bias as we are using most of the data for fitting, and also significantly reduces variance as most of the data is also being used in the validation set.
What is multicollinearity and how to detect it in a dataset?
Let’s start with understanding correlation.
The correlation between two variables can be measured with a correlation coefficient that can range between -1 to 1. If the value is 0, the two variables are independent and there is no correlation. If the measure is extremely close to one of the extreme values, it indicates a linear relationship and is highly correlated with each other. This means a change in one variable is associated with a significant change in other variables.
Multicollinearity is a condition when there is a significant dependency or association between the independent variables or the predictor variables. A significant correlation between the independent variables is often the first evidence of the presence of multicollinearity.
How to test Multicollinearity?
Correlation matrix / Correlation plot: A correlation plot can be used to identify the correlation or bivariate relationship between two independent variables
Variation Inflation Factor (VIF): VIF is used to identify the correlation of one independent variable with a group of other variables.
Consider that we have 9 (assume V1 to V9) independent variables. To calculate the VIF of variable V1, we isolate the variable V1 and consider it as the target variable and all the other variables(i.e V2 to V9) will be treated as the predictor variables.
We use all the other predictor variables and train a regression model and find out the corresponding R2 value.
Using this R2 value, we compute the VIF value.
It is always desirable to have a VIF value as small as possible. A threshold is also set, which means that any independent variable greater than the threshold will have to be removed.
How is oversampling different from undersampling?
Oversampling and undersampling are 2 important techniques used in machine learning – classification problems in order to reduce the class imbalance thereby increasing the accuracy of the model.
Classification is nothing but predicting the category of a data point to which it may probably belong by learning about past characteristics of similar instances. When the segregation of classes is not approximately equal then it can be termed a “Class imbalance” problem. To solve this scenario in our data set, we use oversampling and undersampling.
Oversampling is used when the amount of data collected is insufficient. A popular over-sampling technique is SMOTE (Synthetic Minority Over-sampling Technique), which creates synthetic samples by randomly sampling the characteristics from occurrences in the minority class.
Conversely, if a class of data is the overrepresented majority class, under-sampling may be used to balance it with the minority class. Under-sampling is used when the amount of collected data is sufficient. Common methods of under-sampling include cluster centroids and To make links, both of which target potential overlapping characteristics within the collected data sets to reduce the amount of majority data.
Ex: Let’s say in a bank majority of the customers are from a specific Race and very few customers are from other races, hence if the model is trained with this data, it is most likely that the Model will reject the loan for Minority Race.
So, what should we do about it?
For, oversampling we increase the number of records belonging to the “minority race” category by duplicating its presence. So that the difference between the numbers of records belonging to both of the classes will narrow down.
Under-sampling we reduce the number of records belonging to the “majority race”. The records for the deletion are selected strictly through a random process and are not influenced by any constraints or bias.
To conclude, oversampling is preferable as undersampling can result in the loss of important data. Under-sampling is suggested when the amount of data collected is larger than ideal and can help data mining tools to stay within the limits of what they can effectively process.
Difference Between Bagging and Boosting?
Bagging and Boosting are two types of Ensemble Learning, which help to improve machine learning results by combining several models. This approach allows the production of better predictive performance compared to a single model.
So, let’s understand the difference between Bagging and Boosting?
Bagging(Bootstrap aggregation): It is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting: It is also a homogeneous weak learners’ model. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm.
If the classifier is unstable (high variance), then we need to apply bagging. If the classifier is steady and straightforward (high bias), then we need to apply boosting.
In bagging, different training data subsets are randomly drawn with replacement from the entire training dataset. In boosting, every new subset contains the elements that were misclassified by previous models.
Bagging simplest way of combining predictions that belong to the same type. Boosting is the way of combining predictions that belong to the different types.
Each model is built independently for bagging. While in the case of boosting, new models are influenced by the performance of previously built models.
Bagging attempts to tackle the overfitting issue. Boosting tries to reduce bias.
Example: The Random Forest model uses Bagging. The AdaBoost uses Boosting techniques
If the probability of seeing a shooting star in 15mins is 20%. What is the probability of seeing at least one shooting star in one hour?
Here it means, 20% probability = 20/100 = 1/5
Probability of Seeing a Star in 15 minutes = 1/5
Probability of not seeing a Star in 15 minutes = 1 - 1/5 = 4/5
The probability that you see at least one shooting star in the period of an hour
= 1 - Probability of not seeing any Star in 60 minutes
= 1 - Probability of not seeing any Star in 15 * 4 minutes
= 1 - (4/5)⁴
= 1 - 0.4096
So, the probability of seeing at least one shooting star in a period of an hour is 0.594
What is the statistical power of sensitivity?
The statistical power of a study (sometimes called sensitivity) is how likely the study is to distinguish an actual effect from one of chance.
It’s the likelihood that the test is correctly rejecting the null hypothesis. For example, a study that has an 80% power means that the study has an 80% chance of the test having significant results.
High statistical power means that the test results are likely valid. As the power increases, the probability of making a Type II error decreases.
Low statistical power means that the test results are questionable.
Statistical power helps you to determine if your sample size is large enough.
It is possible to perform a hypothesis test without calculating the statistical power. If your sample size is too small, your results may be inconclusive when they may have been conclusive if you had a large enough sample.
What is Bayes’ Theorem? How is it useful in a machine learning context?
Bayes' theorem, also known as Bayes' rule or Bayes' law, is a theorem in statistics that describes the probability of one event or condition as it relates to another known event or condition.
Mathematically, it’s expressed as the true positive rate of a condition sample divided by the sum of the false positive rate of the population and the true positive rate of a condition.
Let’s understand this theorem with an example:
For instance, say you had a 60% chance of actually having the flu after a flu test, but out of people who had the flu, the test will be false 50% of the time, and the overall population only has a 5% chance of having the flu. Would you actually have a 60% chance of having the flu after having a positive test?
Bayes’ Theorem says no. It says that you have a (.6 * 0.05) (True Positive Rate of a Condition Sample) / (.6*0.05)(True Positive Rate of a Condition Sample) + (.5*0.95) (False Positive Rate of a Population) = 0.0594 or 5.94% chance of getting a flu.
Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other.
Naive Bayes uses a similar method to predict the probability of different class based on various attributes. This algorithm is mostly used in text classification and with problems having multiple classes.
What are Loss Functions and Cost Functions? Explain the key Difference Between them?
When calculating loss we consider only a single data point, then we use the term loss function.
Whereas, when calculating the sum of error for multiple data then we use the cost function.
In other words, the loss function is to capture the difference between the actual and predicted values for a single record whereas cost functions aggregate the difference for the entire training dataset.
The most commonly used loss functions are Mean-squared error and Hinge loss.
Mean-Squared Error(MSE): In simple words, we can say how our model predicted values against the actual values.
Hinge loss: It is used to train the machine learning classifier, which is
L(y) = max(0,1- yy)
Where y = -1 or 1 indicates two classes and y represents the output form of the classifier. The most common cost function represents the total cost as the sum of the fixed costs and the variable costs in the equation y = mx + b.
There are many cost functions in machine learning and each has its use cases depending on whether it is a regression problem or classification problem.
Regression cost function:
Regression models are used to forecast a continuous variable, such as an employee's pay, the cost of a car, the likelihood of obtaining a loan, and so on. They are determined as follows depending on the distance-based error:
Error = y-y’
Where, Y – Actual Input and Y’ – Predicted output
How would you evaluate a logistic regression model?
Model Evaluation is a very important part of any analysis to answer the following questions,
How well does the model fit the data? Which predictors are most important? Are the predictions accurate?
So, the following are the criterion to access the model performance,
1. Akaike Information Criteria (AIC): In simple terms, AIC estimates the relative amount of information lost by a given model. So, the less information lost the higher the quality of the model. Therefore, we always prefer models with minimum AIC.
2. Receiver operating characteristics (ROC curve): ROC curve illustrates the diagnostic ability of a binary classifier. It is calculated/ created by plotting True Positive against False Positive at various threshold settings. The performance metric of the ROC curve is AUC (area under the curve). Higher the area under the curve, the better the prediction power of the model.
3. Confusion Matrix: In order to find out how well the model does in predicting the target variable, we use a confusion matrix/ classification rate. It is nothing but a tabular representation of actual Vs predicted values which helps us to find the accuracy of the model.
rather than questions, these are some important we need to understand their usage rather than going through simple theory and coding part in data science. hope its informative.