1. Are expected value and mean value different ? (Data Science Interview Questions For Beginners)
They are not different but the terms are used in different contexts. Mean is generally referred when talking about a probability distribution or sample population whereas expected value is generally referred in a random variable context.
( Data Science Interview Questions For Beginners )
For Sampling Data
Mean value is the only value that comes from the sampling data.
Expected Value is the mean of all the means i.e. the value that is built from multiple samples. Expected value is the population mean.
Mean value and Expected value are same irrespective of the distribution, under the condition that the distribution is in the same population.
2. What is the difference between Supervised Learning an Unsupervised Learning ?
If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example for unsupervised learning.
3. Explain about the box cox transformation in regression models ?
For some reason or the other, the response variable for a regression analysis might not satisfy one or more assumptions of an ordinary least squares regression. The residuals could either curve as the prediction increases or follow skewed distribution. In such scenarios, it is necessary to transform the response variable so that the data meets the required assumptions. A Box cox transformation is a statistical technique to transform non-mornla dependent variables into a normal shape. If the given data is not normal then most of the statistical techniques assume normality. Applying a box cox transformation means that you can run a broader number of tests.
4. How are confidence intervals constructed and how will you interpret them ?
Though the Clustering Algorithm is not specified, this question will mostly be asked in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.
For example, the following image shows three different groups.
K Mean Clustering Machine Learning Algorithm
Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of number of clusters, you will get the plot shown below. The Graph is generally known as Elbow Curve.
5. Why L1 regularizations causes parameter sparsity whereas L2 regularization does not ?
Regularizations in statistics or in the field of machine learning is used to include some extra information in order to solve a problem in a better way. L1 & L2 regularizations are generally used to add constraints to optimization problems.
L1 L2 Regularizations
In the example shown above H0 is a hypothesis. If you observe, in L1 there is a high likelihood to hit the corners as solutions while in L2, it doesn’t. So in L1 variables are penalized more as compared to L2 which results into sparsity.
In other words, errors are squared in L2, so model sees higher error and tries to minimize that squared error.
6. What are feature vectors ?
A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
7. Differentiate between univariate, bivariate and multivariate analysis ?
These are descriptive statistical analysis techniques which can be differentiated based on the number of variables involved at a given point of time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis.
If the analysis attempts to understand the difference between 2 variables at time as in a scatterplot, then it is referred to as bivariate analysis. For example, analysing the volume of sale and a spending can be considered as an example of bivariate analysis.
Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.
8. Can you cite some examples where a false negative important than a false positive ?
Example 1: Assume there is an airport ‘A’ which has received high-security threats and based on certain characteristics they identify whether a particular passenger can be a threat or not. Due to a shortage of staff, they decide to scan passengers being predicted as risk positives by their predictive model.
(Data Science Interview Questions For Beginners)
9. What is the difference between extrapolation and interpolation ?
You have a list of values and when you want to estimate a value from two known values from the list, it’s called interpolation. When you extend known sets of facts or values to approximate a value, it’s called extrapolation.
10. What is the purpose of A/B testing ?
The purpose of A/B testing is to generate crucial insights by testing two variables (A and B) of a purpose-driven campaign. The purpose is to identify which variable performs better than the other and achieves a set goal. This paves way for informed decisions.
11. How different is a mean value different from expected value ?
Mean and expected values are similar but are used in different contexts. While expected values are usually referred to in a random variable context, mean values are referred in the contexts of sample population or probability distribution.
12. If you had to choose between the programming languages R and Python, Which one would you use for text analytics ?
Personally, I would choose Python for text analytics as it offers solid data analysis tools and simple data structures, thanks to its Panda library.
13. What is meant by feature vectors ?
Basically, it is an n-dimensional vector of numerical features. That is used to represents some object. Moreover, in machine learning we can use it to represent some object.
14. Explain Data Science Vs Machine Learning ?
As both machine learning and statistics are part of data science. Also, Machine learning itself defines that the algorithms depend on some data. Further, we can use it as a training set, to fine-tune some model or algorithm parameters.
In particular, data science also covers:
automating machine learning.
dashboards and BI.
deployment in production mode.
automated, data-driven decisions.
15. What does Machine Learning mean for the Future of Data Science ?
“Data science includes machine learning.”
Basically, it is the ability of a machine to generalize knowledge from data—call it learning. Although, without data, there are little machines can learn.
To push data science to increase relevance, a catalyst is an important thing. While it helps in increasing machine learning usage in different industries. As machine learning is good because it has data within it. Also, it has the ability to consume algorithms in it. Moreover, my expectation is that to move forward basic levels of machine learning. Thus, it will become a standard need for data scientists.
Data Science Interview Questions For Beginners
16. What is meant by logistic regression ?
It is a method for fitting a regression curve, y = f(x) when y is a categorical variable.
It is a classification algorithm. We use it to predict a binary outcome (1 / 0, Yes / No, True / False) given a set of independent variables. To represent binary / categorical outcome, we use dummy variables.
It is a regression model in which the response variable was categorical values such as True/False or 0/1. Thus, it actually measures the probability of a binary response.
To perform logistic regression in R, use the command:
glm( response ~ explanantory_variables , family=binomial)
17. What is meant by Poisson regression ?
Poisson Regression Data is often collected in counts. Hence, many discrete response variables have counted as possible outcomes. While binomial counts are the number of successes in a fixed number of trials, n.
Poisson counts are the number of occurrences of some event in a certain interval of time (or space). Apart from this, Poisson counts have no upper bound and binomial counts only take values between 0 and n.
To perform logistic regression in R, we use the command:
glm( response ~ explanantory_variables , family=poisson)
Learn more about Poisson regression and Binomial regression in R.
18. What are Principal components ?
t is a normalized linear combination of the original predictors in a data set. We can write the principal component in following way:
Z¹ = Φ¹¹X¹ + Φ²¹X² + Φ³¹X³ + …. +Φp¹Xp Where, Z¹ is first principal component
Φp¹ is the loading vector comprising of loadings (Φ¹, Φ²..) of a first principal component. Also, the loadings are constrained to a sum of square equals to 1. This is because the large size of loadings may lead to large variance. It also defines the direction of the principal component (Z¹) along which data varies the most. Moreover, it results in a line in p dimensional space which is closest to the n observations. We can measure closeness using average squared Euclidean distance.
X¹..Xp is normalized predictors. Normalized predictors have mean equals to zero and standard deviation equals to one.
19. Name the methods for General component analysis and explain them ?
There are two methods :
a. Spectral decomposition: Which examines the covariances/correlations between variables.
b. Singular value decomposition: Which examines the covariances/correlations between individuals. We use the function princomp() for the spectral approach. Also, use the functions prcomp() and PCA() in the singular value decomposition.
20. Explain types of statistical data ?
Whenever we are working with statistics it’s very important to recognize the different types of data:
1. numerical (discrete and continuous),
2. categorical, and
Data are the actual pieces of information that you collect through your study.
Most data fall into one of two groups: numerical or categorical (company)
21. What is correlation in R ?
Basically, it is a type of technique. Also, we use it for investigating the relationship between two quantitative, continuous variables.
Positive correlation – In this, both variables increase or decrease together.
Negative correlation – In this as one variable increases, so the other decreases.
22. What is meant by lattice graphs ?
The lattice package was written by Deepayan Sarkar. Also, he provides better defaults. It also provides the ability to display multivariate relationships. And trying to improve on-base R graphics. Moreover, this package supports the creation of trellis graphs –
1. basically the graphs that display a variable or
2. also, the relationship between variables, conditioned on one or
3. more other variables.
The typical format is:
Basically, first we will select graph_type from the listed below. Then this formula specifies the variable(s) to display and any conditioning variable.
23. Compare SAS, R, Python, Perl ?
a) SAS is a commercial software. It is expensive and still beyond reach for most of the professionals (in individual capacity). However, it holds the highest market share in Private Organizations. So, until and unless you are in an Organization which has invested in SAS, it might be difficult to access one. R & Python, on the other hand are free and can be downloaded by any one.
b) SAS is easy to learn and provides easy option (PROC SQL) for people who already know SQL. Even otherwise, it has a good stable GUI interface in its repository. In terms of resources, there are tutorials available on websites of various university and SAS has a comprehensive documentation. There are certifications from SAS training institutes, but they again come at a cost. R has the steepest learning curve among the 3 languages listed here. It requires you to learn and understand coding. R is a low level programming language and hence simple procedures can take longer codes. Python is known for its simplicity in programming world. This remains true for data analysis as well.
c) SAS has decent functional graphical capabilities. However, it is just functional. Any customization on plots are difficult and requires you to understand intricacies of SAS Graph package. R has the most advanced graphical capabilities among the three. There are numerous packages which provide you advanced graphical capabilities. Python capabilities will lie somewhere in between, with options to use native libraries (matplotlib) or derived libraries (allowing calling R functions).
d) All 3 ecosystems have all the basic and most needed functions available. This feature only matters if you are working on latest technologies and algorithms. Due to their open nature, R & Python get latest features quickly (R more so compared to Python). SAS, on the other hand updates its capabilities in new version roll-outs. Since R has been used widely in academics in past, development of new techniques is fast.
Data Science Interview Questions For Beginners
24. How do Data Scientists Code in R ?
R is a popular open source programming environment for statistics and data mining. The good news is that it is easily integrated into ML Studio. I have a lot of friends using functional languages for machine learning, such as F#. It’s pretty clear, however, that R is dominant in this space. Polls and surveys of data miners are showing R’s popularity has increased substantially in recent years. R was created by Ross Ihaka and Robert Gentleman at the University of Auckland, New Zealand, and is currently developed by the R Development Core Team, of which Chambers is a member. R is named partly after the first names of the first two R authors. R is a GNU project and is written primarily in C, Fortran.
25. What are the most important machine learning techniques ?
C will also happen).
In Clustering computers learn how to partition observations in various subsets, so that each partition will be made up of similar observations according to some well-defined metric. Algorithms like K-Means and DBSCAN belong also to this class.
In Density estimation computers learn how to find statistical values that describe data. Algorithms like Expectation Maximization belong also to this class.
26. Can you provide an example of features extraction ?
Let’s suppose that we want to perform machine learning on textual files. The first step is to extract meaningful feature vectors from a text. A typical representation is the so called bag of words where:
1. Each word w in the text collection is associated with a unique integer = wordld(w) assigned to it.
2. For each document i, the number of occurrences of each word w is computed and this value is stored in a matrix M(i.D. Please, note that M is typically a sparse matrix because when a word is not present in a document, its count will be zero.
Numpy, Scikit-learn and Spark all support sparse vectors2. Let’s see an example where we start to load a dataset made up of Usenet articles; where the altatheism category is considered and the collection of text documents is converted into a matrix of token counts. We then print the wordId(‘man).
From sklearn.datasets import fetch_20 new groups
Two additional observations can be here highlighted: first, the training set, the validation set and the test set are all sampled from the same gold set but those samples are independent. Second, it has been assumed that the learned model can be described by means of two different functions f and h combined by using a set of hyper-parameters A.
Unsupervised machine learning consists in tests and application phases only because there is no model to be learned a-priori. In fact unsupervised algorithms adapt dynamically to the observed data. (data science training)
27. Why are vectors and norms used in machine learning ?
Objects such as movies, songs and documents are typically represented by means of vectors of features. Those features are a synthetic summary of the most salient and discriminative objects characteristics. Given a collection of vectors (the so-called vector space) V, a norm on V is a function P: V A satisfying the following properties: For all complex numbers a and all
u, v E V,
1. P(av) = |a| p(v)
2. P(u+v) <= p(u) + p(v)
3. If p(v) = 0 then v is the zero vector
The intuitive notion of length for a vector is captured by
||x||2 = root (x12 + … + xn2 )
More generally we have
28. What are Numpy, Scipy and Spark essential datatypes ?
Numpy provides efficient support for memorizing vectors and matrices and for linear algebra operations 5. For instance: dot(a, b[, out]) is the dot product of two vectors, while inner(a, b) and outer(a, b[, out]) are respectively the inner and outer products.
Scipy provides support for sparse matrices and vectors with multiple memorization strategies in order to save space when dealing with zero entries.6 In particular the COOrdinate format specifies the non-zero v value for the coordinates(•), while the Compressed Sparse Colum matrix (CSC) satisfies the relationship M[row_ind[k], col_ind[k]] = data[k] Spark has many native datatypes for local and distributed computations. The primary data abstraction is a distributed collection of items called “Resilient Distributed Dataset (RDD)”. RDDs can be created from Hadoop InputFormats’
or by transforming other RDDs. Numpy arrays, Python list and Scipy CSC sparse matrices are all supported. In addition: MLIB, the Spark library for machine learning, supports SparseVectors and LabeledPoint, i.e. local vectors, either dense or sparse, associated with a label/response
import numpy as np
from scipy.sparse import csr_matrix
M = csr_matrix ([[4, 1, 0], [4, 0, 3], [0, 0, 1]])
from pyspark.mllib.linalg import SparseVector
from pyspark.mllib.regression import LabeledPoint
label = 0.0 point = LabeledPoint(label. SparseVector(3, [0, 2], [1.0, 3.0]))
textRDD = sc.textFile(“README.md”)
print textRDD.count() # count items in RDD
29. Can you provide examples for other computations in Spark ?
The first code fragment is an example of map reduction, where we want to find the line with most words in a text. First each line is mapped into the number of words it contains. Then those numbers are reduced and the maximum is taken. Pretty simple: one single line of code stays here for something which requires hundreds of lines in other parallel paradigms such as Hadoop.
Spark supports two types of operations: transformations, which create a new RDD dataset from an existing one, and actions, which return a value to the driver program after running a computation on the dataset. All transformations in Spark are lazy because they postpone computation as much as possible until the results are really needed by the program. This allows Spark to run efficiently — for example the compiler can realize that an RDD created through map will be used in a reduce and return only the result of the reduce to the driver, rather than the larger mapped dataset. Intermediate results can be persisted and cached.
30. How does data cleaning plays a vital role in analysis ?
Data cleaning can help in analysis because:
Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
It might take up to 80% of the time for just cleaning data making it a critical part of analysis task.
31. What is Cluster Sampling ?
Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
Data Science Interview Questions For Beginners
32. What are Eigenvectors and Eigenvalues ?
Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
33. Can you explain the difference between a Validation Set and a Test Set ?
Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a test set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
34. Explain cross-validation ?
Cross validation is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.
35. A certain couple tells you that they have two children, at least one of which is a girl. What is the probability that they have two girls ?
In the case of two children, there are 4 equally likely possibilities
BB, BG, GB and GG;
where B = Boy and G = Girl and the first letter denotes the first child.
From the question, we can exclude the first case of BB. Thus from the remaining 3 possibilities of BG, GB & BB, we have to find the probability of the case with two girls.
Thus, P(Having two girls given one girl) = 1 / 3
36. Why is it mandatory to clean a data set ?
Cleaning data makes it into a format that allows data scientists to work on it. This is crucial because if data sets are not cleaned, it may lead to biased information that can alter business decisions. Over 80% of the time is spent by data scientists to clean data.
37. What are the steps involved in analytics projects ?
Any analytics problem involves the following steps:
Understanding a business problem
Data preparation for modeling
Running the model and analysis of results
Model validation using new data sets
Model implementation and tracking of results for a set period of time
What do you understand by the term recommender systems?
Recommender systems are part of an information filtering system that is used to predict and anticipate the ratings or preferences a user is most likely to give to a product or service. You can see recommender systems at work on eCommerce websites, movie websites, research articles, music apps, news and more.
38. For linear regression, what are some of the assumptions a data scientist is most likely to make ?
Some of the assumptions include the following:
No or little multicollinearity
How do you find the correlation between a categorical variable and a continuous variable?
It is possible to find the correlation between a categorical variable and a continuous variable using the analysis of covariance technique.
So, these were some of the most common analytics interview questions. Apart from these, there are also questions like coding and writing a program on languages. If you didn’t know answers to these questions, read and understand them. If you know, pass this on to people you think would benefit from this. Also, if you’ve been asked any unique analytics question, share it in the comments below.
39. What is the difference between skewed and uniform distribution ?
When the observations in a dataset are spread equally across the range of distribution, then it is referred to as uniform distribution. There are no clear perks in an uniform distribution. Distributions that have more observations on one side of the graph than the other are referred to as skewed distribution.Distributions with fewer observations on the left ( towards lower values) are said to be skewed left and distributions with fewer observation on the right ( towards higher values) are said to be skewed right.