Data Science Interview Questions For Beginners
1. Are the expected value and mean value different?
Answer: They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population means.
Mean value and Expected value are the same irrespective of the distribution, under the condition that the distribution is in the same population.
2. What is the difference between Supervised Learning an Unsupervised Learning?
Answer: If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning.
3. Explain about the box cox transformation in regression models?
Answer: For some reason or the other, the response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow the skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-morna dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.
4. How are confidence intervals constructed and how will you interpret them?
Answer: Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.
For example, the following image shows three different groups.
K Mean Clustering Machine Learning Algorithm
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of a number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.
5. Why L1 regularizations causes parameter sparsity whereas L2 regularization does not?
Answer: Regularizations in statistics or in the field of machine learning is used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems.
L1 L2 Regularizations
In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2 which results in sparsity.
In other words, errors are squared in L2, so model sees the higher error and tries to minimize that squared error.
6. What are the feature vectors?
Answer: A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
7. Differentiate between univariate, bivariate and multivariate analysis?
Answer: These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at the time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sale and spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
8. Can you cite some examples where a false negative important than a false positive
Answer: Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model.
9. What is the difference between extrapolation and interpolation?
Answer: You have a list of values and when you want to estimate value from two known values from the list, it’s called interpolation. When you extend known sets of facts or values to approximate value, it’s called extrapolation.
10. What is the purpose of A/B testing?
Answer: The purpose of A/B testing is to generate crucial insights by testing two variables (A and B) of a purpose-driven campaign. The purpose is to identify which variable performs better than the other and achieves a set goal. This paves way for informed decisions.
11. How different is a mean value different from expected value?
Answer: Mean and expected values are similar but are used in different contexts. While expected values are usually referred to in a random variable context, mean values are referred in the contexts of sample population or probability distribution.
12. If you had to choose between the programming languages R and Python, which one would you use for text analytics?
Answer: Basically, it is an n-dimensional vector of numerical features. That is used to represents some object. Moreover, in machine learning, we can use it to represent some object.
13. What is meant by feature vectors?
Answer: Basically, it is an n-dimensional vector of numerical features. That is used to represents some object. Moreover, in machine learning, we can use it to represent some object.
14. Explain Data Science Vs Machine Learning?
Answer: As both machine learning and statistics are part of data science. Also, Machine learning itself defines that the algorithms depend on some data. Further, we can use it as a training set, to fine-tune some model or algorithm parameters.
In particular, data science also covers:
automating machine learning.
dashboards and BI.
deployment in production mode.
automated, data-driven decisions.
15. What does Machine Learning mean for the Future of Data Science?
Answer: “Data science includes machine learning.”
Basically, it is the ability of a machine to generalize knowledge from data—call it learning. Although, without data, there are little machines can learn.
To push data science to increase relevance, a catalyst is an important thing. While it helps in increasing machine learning usage in different industries. As machine learning is good because it has data within it. Also, it has the ability to consume algorithms in it. Moreover, my expectation is that to move forward basic levels of machine learning. Thus, it will become a standard need for data scientists.
16. What is meant by logistic regression?
Answer: It is a method for fitting a regression curve, y = f(x) when y is a categorical variable.
It is a classification algorithm. We use it to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent the binary/categorical outcome, we use dummy variables.
It is a regression model in which the response variable was categorical values such as True/False or 0/1. Thus, it actually measures the probability of a binary response.
To perform logistic regression in R, use the command:
glm( response ~ explanantory_variables , family=binomial)
17. What is meant by Poisson regression?
Answer: Poisson Regression Data is often collected in counts. Hence, many discrete response variables have counted as possible outcomes. While binomial counts are the number of successes in a fixed number of trials, n.
Poisson counts are the number of occurrences of some event in a certain interval of time (or space). Apart from this, Poisson counts have no upper bound and binomial counts only take values between 0 and n.
To perform logistic regression in R, we use the command:
glm( response ~ explanantory_variables , family=poisson)
Learn more about Poisson regression and Binomial regression in R.
18. What are the Principal components?
Answer: the is a normalized linear combination of the original predictors in a data set. We can write the principal component in the following way:
Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + …. +Φp¹Xp Where, Z¹ is first principal component
Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of a first principal component. Also, the loadings are constrained to a sum of square equals to 1. This is because the large size of loadings may lead to large variance. It also defines the direction of the principal component (Z¹) along which data varies the most. Moreover, it results in a line in p dimensional space which is closest to the n observations. We can measure closeness using average squared Euclidean distance.
X¹.XP is normalized predictors. Normalized predictors have mean equals to zero and the standard deviation equals to one.
19. Name the methods for General component analysis and explain them?
Answer: There are two methods :
a. Spectral decomposition: Which examines the covariances/correlations between variables.
b. Singular value decomposition: Which examines the covariances/correlations between individuals. We use the function princomp() for the spectral approach. Also, use the functions prcomp() and PCA() in the singular value decomposition.
20. Explain types of statistical data?
Answer: Whenever we are working with statistics it’s very important to recognize the different types of data:
1. numerical (discrete and continuous),
2. categorical, and
Data are the actual pieces of information that you collect through your study.
Most data fall into one of two groups: numerical or categorical (company)
21. What is the correlation in R?
Answer: Basically, it is a type of technique. Also, we use it for investigating the relationship between two quantitative, continuous variables.
Positive correlation – In this, both variables increase or decrease together.
Negative correlation – In this as one variable increases, so the other decreases.
22. What is meant by lattice graphs?
Answer: The lattice package was written by Deepayan Sarkar. Also, he provides better defaults. It also provides the ability to display multivariate relationships. And trying to improve on-base R graphics. Moreover, this package supports the creation of trellis graphs –
1. basically the graphs that display a variable or
2. also, the relationship between variables, conditioned on one or
3. more other variables.
The typical format is:
Basically, first, we will select graph_type from the listed below. Then this formula specifies the variable(s) to display and any conditioning variable.
23. Compare SAS, R, Python, Perl?
Answer: a) SAS is commercial software. It is expensive and still beyond reach for most of the professionals (in individual capacity). However, it holds the highest market share in Private Organizations. So, until and unless you are in an Organization which has invested in SAS, it might be difficult to access one. R & Python, on the other hand, is free and can be downloaded by anyone.
b) SAS is easy to learn and provides an easy option (PROC SQL) for people who already know SQL. Even otherwise, it has a good stable GUI interface in its repository. In terms of resources, there are tutorials available on websites of various university and SAS has comprehensive documentation. There are certifications from SAS training institutes, but they again come at a cost. R has the steepest learning curve among the 3 languages listed here. It requires you to learn and understand coding. R is a low-level programming language and hence simple procedures can take longer codes. Python is known for its simplicity in the programming world. This remains true for data analysis as well.
c) SAS has decent functional graphical capabilities. However, it is just functional. Any customization on plots is difficult and requires you to understand the intricacies of SAS Graph package. R has the most advanced graphical capabilities among the three. There are numerous packages which provide you advanced graphical capabilities. Python capabilities will lie somewhere in between, with options to use native libraries (matplotlib) or derived libraries (allowing calling R functions).
d) All 3 ecosystems have all the basic and most needed functions available. This feature only matters if you are working on the latest technologies and algorithms. Due to their open nature, R & Python get latest features quickly (R more so compared to Python). SAS, on the other hand, updates its capabilities in new version roll-outs. Since R has been used widely in academics in the past, the development of new techniques is fast.
24. How do Data Scientists Code in R?
Answer: R is a popular open source programming environment for statistics and data mining. The good news is that it is easily integrated into ML Studio. I have a lot of friends using functional languages for machine learning, such as F#. It’s pretty clear, however, that R is dominant in this space. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors. R is a GNU project and is written primarily in C, Fortran.
25. What are the most important machine learning techniques?
Answer: C will also happen).
In Clustering computers learn how to partition observations in various subsets so that each partition will be made up of similar observations according to some well-defined metric. Algorithms like K-Means and DBSCAN belong also to this class.
In Density estimation, computers learn how to find statistical values that describe data. Algorithms like Expectation-Maximization belong also to this class.
26. Can you provide an example of features extraction?
Answer: Let’s suppose that we want to perform machine learning on textual files. The first step is to extract meaningful feature vectors from a text. A typical representation is the so-called bag of words where:
1. Each word w in the text collection is associated with a unique integer = world(w) assigned to it.
2. For each document I, the number of occurrences of each word w is computed and this value is stored in a matrix M(i.D. Please, note that M is typically a sparse matrix because when a word is not present in a document, its count will be zero.
Numpy, Scikit-Learn, and Spark all support sparse vectors2. Let’s see an example where we start to load a dataset made up of Usenet articles; where the altatheism category is considered and the collection of text documents is converted into a matrix of token counts. We then print the word(‘man).
From sklearn.datasets import fetch_20 new groups
Two additional observations can be here highlighted: first, the training set, the validation set, and the test set are all sampled from the same gold set but those samples are independent. Second, it has been assumed that the learned model can be described by means of two different functions f and h combined by using a set of hyper-parameters A.
Unsupervised machine learning consists of tests and application phases only because there is no model to be learned a-priori. In fact, unsupervised algorithms adapt dynamically to the observed data.
27. Why are vectors and norms used in machine learning?
Answer: Objects such as movies, songs, and documents are typically represented by means of vectors of features. Those features are a synthetic summary of the characteristics of the most salient and discriminative object. Given a collection of vectors (the so-called vector space) V, a norm on V is a function P: V A satisfying the following properties: For all complex numbers a and all
u, v E V,
1. P(av) = |a| p(v)
2. P(u+v) <= p(u) + p(v)
3. If p(v) = 0 then v is the zero vector
The intuitive notion of length for a vector is captured by
||x||2 = root (x12 + … + xn2 )
More generally we have
28. What are Numpy, Scipy, and Spark essential datatypes?
Answer: Numpy provides efficient support for memorizing vectors and matrices and for linear algebra operations 5. For instance: dot(a, b[, out]) is the dot product of two vectors, while inner(a, b) and outer(a, b[, out]) are respectively the inner and outer products.
Scipy provides support for sparse matrices and vectors with multiple memorization strategies in order to save space when dealing with zero entries.6 In particular the COOrdinate format specifies the non-zero v value for the coordinates(•), while the Compressed Sparse Colum matrix (CSC) satisfies the relationship M[row_ind[k], col_ind[k]] = data[k]
Spark has many native datatypes for local and distributed computations. The primary data abstraction is a distributed collection of items called “Resilient Distributed Dataset (RDD)”. RDDs can be created from Hadoop InputFormats’ or by transforming other RDDs. Numpy arrays, Python list and Scipy CSC sparse matrices are all supported. In addition: MLIB, the Spark library for machine learning, supports SparseVectors and LabeledPoint, i.e. local vectors, either dense or sparse, associated with a label/response
import numpy as np
from scipy.sparse import csr_matrix
M = csr_matrix ([[4, 1, 0], [4, 0, 3], [0, 0, 1]])
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
label = 0.0 point = LabeledPoint(label. SparseVector(3, [0, 2], [1.0, 3.0]))
textRDD = sc.textFile(“README.md”)
print textRDD.count() # count items in RDD
29. Can you provide examples for other computations in Spark?
Answer: The first code fragment is an example of map reduction, where we want to find the line with most words in a text. First, each line is mapped into the number of words it contains. Then those numbers are reduced and the maximum is taken.
Pretty simple: one single line of code stays here for something which requires hundreds of lines in other parallel paradigms such as Hadoop.
Spark supports two types of operations: transformations, which create a new RDD dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. All transformations in Spark are lazy because they postpone computation as much as possible until the results are really needed by the program. This allows Spark to run efficiently — for example, the compiler can realize that an RDD created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset. Intermediate results can be persisted and cached.
30. How does data cleaning plays a vital role in the analysis?
Data cleaning can help in the analysis because:
Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
31. What is Cluster Sampling?
Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
32. What are Eigenvectors and Eigenvalues?
Answer: Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
33. Can you explain the difference between a Validation Set and a Test Set?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
34. Explain cross-validation?
Answer: Cross-validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
35. A certain couple tells you that they have two children, at least one of whom is a girl. What is the probability that they have two girls?
Answer: In the case of two children, there are 4 equally likely possibilities
BB, BG, GB, and GG;
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1/3
36. Why is it mandatory to clean a data set?
Answer: Cleaning data makes it into a format that allows data scientists to work on it. This is crucial because if data sets are not cleaned, it may lead to biased information that can alter business decisions. Over 80% of the time is spent by data scientists to clean data.
37. What are the steps involved in analytics projects?
Any analytics problem involves the following steps:
Understanding a business problem
Data preparation for modeling
Running the model and analysis of results
Model validation using new data sets
Model implementation and tracking of results for a set period of time.
38. For linear regression, what are some of the assumptions a data scientist is most likely to make?
Answer: Some of the assumptions include the following:
No or little multicollinearity
39. What is the difference between skewed and uniform distribution?
Answer: When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in a uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution. Distributions with fewer observations on the left ( towards lower values) are said to be skewed left and distributions with an observation on the right ( towards higher values) are said to be skewed right.