2020 Latest Data Science Interview Questions And Answers
Data Science Interview Questions And Answers
What Is Data Science?
Answer: Data Science is a new area of specialization being developed by the Department of Mathematics and Statistics at Bowling Green State University. This field integrates math, statistics, and computer science to prepare for the rapidly expanding need for data scientists. Students seeking to pursue studies in Data Science should declare a mathematics major as entering freshmen, in anticipation of completing the specialization in years three and four.
What are Interpolation and Extrapolation?
Answer: The terms of interpolation and extrapolation are extremely important in any statistical analysis. Extrapolation is the determination or estimation using a known set of values or facts by extending it and taking it to an area or region that is unknown. It is the technique of inferring something using data that is available.
Interpolation, on the other hand, is the method of determining a certain value which falls between a certain set of values or the sequence of values. This is especially useful when you have data at the two extremities of a certain region but you don’t have enough data points at a specific point. This is when you deploy interpolation to determine the value that you need.
Do gradient descent methods always converge to the same point?
Answer: No, they do not because in some cases it reaches a local minimum or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions.
What is meant by lattice?
Answer: It is a powerful and elegant high-level data visualization system. That is being inspired by Trellis graphics. Although, it is being designed with an emphasis on multivariate data. Also, it allows easy conditioning to produce “small multiple” plots.
Explain calculate distance methods briefly?
Answer:
- a. Euclidean distance – Basically, it is a classical method. Also, it helps to compute a distance between two objects A and B in Euclidean space. In this geometry, we can find the distance between points by traveling along the lines. Basically, the lines must be connected through the points. Inherently in the calculation, you use the Pythagorean Theorem to compute the distance.
- b.Taxicab or Manhattan distance – It is like a Euclidean distance. Although, there is only one difference. Moreover, we can calculate the distance by traversing. Also, we have to do traversing the vertical & horizontal line in the grid-based system.
- c. Minkowski – Basically, this distance is a metric on Euclidean space. Also, we can consider it as s a generalized of Euclidean and Manhattan distance.
- d. Cosine Similarity – Basically, it is a measure that calculates the cosine of the angle between two vectors. Basically, this metric is a measurement of orientation and not size. Also, we can use it as a comparison between documents the angle between them.
- e. Mahalanobis distance – Basically we use it to measure a distance between the two groups of the object. Also, we can graphically represent an idea of distance measure. Although, it helps in better understanding. Basically, we can use this type of distance measure. Also, it is helpful for classification and clustering.
How can outlier values be treated?
Answer: Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values –
1) To change the value and bring in within a range
2) To just remove the value.
What are an Eigenvalue and Eigenvector?
Answer: Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
What are distance measures in R statistics?
Answer: Distance Measures (Similarity, dissimilarity, correlation).
Basically, we can consider it as mathematical approaches. Also, it helps us to measure the distance between the objects. Further, we can use computing distance to compare the objects. Now, we can conclude three different standpoints on the basis of comparison such as:
- 1. Similarity
- 2. Dissimilarity
- 3. Correlation
Similarity– it is a measure that ranges from 0 to 1 [0, 1]
Dissimilarity- it is measured that range from 0 to INF [0, Infinity]
Correlation– it is measures that range from +1 to -1 [+1, -1]
Along with this Data Science Interview Questions, I have one more link to Data Science Interview Questions which will help you in interview preparation:
is cross-validation and what is overfitting?
Answer: Learning a model on a set of examples and testing it on the same set is a logical mistake because the model would have no errors in the test set but it will almost certainly have a poor performance on the real application data. This problem is called overfitting and it is the reason why the gold set is typically split into independent sets for training, validation, and test. An example of the random split is reported in the code section, where a toy dataset with diabetics’ data has been randomly split into two parts: the training set and the test set.
As discussed in the previous question: given a family of learned models, the validation set is used for estimating the best hyper-parameters. However, by adopting this strategy there is still the risk that the hyper-parameters overlit a particular validation set.
The solution to this problem is called cross-validation. The idea is simple: the test set is split into k smaller sets called folds and the model is then learned on k – 1 folds, while the remaining data is used for validation. This process is repeated in a loop and the metrics achieved for each iteration are averaged. An example of cross-validation is reported in the section below where our toy dataset is classified via SVM and accuracy is computed via cross-validation. 5VM is a classification technique and “accuracy” is a quality measurement, which will be discussed later in the book.
Stratified KFold is a variation of k-fold where each set contains approximately the same balanced percentage of samples for each target class as the complete set.
Code
import numpy as np
front skleam import cross_validation
from skleam import datasets
from skleam import svm
diabets = datasets.load_diabetes()
X_train, X_test, y_train, y_test =
cross_validation.train_test_split(
diabets.data, diabets.target, test_size*).2, random_state=0)
print X_train.shape, y_trainshape # test size 20%
print X_test.shape, y_test.shape
cif = svm.SVC(kemel=qineari, C=1)
scores = cross_validation.cross_valscore(
clf, diabets.data, diabets.target, cv=4) # 4-folds
print scores
print(“Accuracy: %0.2f (+/- %0.2f)” % (scores.meanQ, scores.std()))
What do you understand by the job description of a data scientist?
Answer: Data Scientist has to use statistical methods. It includes mix modeling, predictive response modeling. Also, Optimization techniques to meet client business needs.
Furthermore, they have to develop and install statistical tools. Also, helps in building predictive models. And these models support clients in customer marketing and demand generation initiatives.
Moreover, Data Scientist collaborates with internal consulting teams to set analytic objectives, approach. Also, work plans to provide programming and analytic support to internal consulting. Also, it provides statistical procedures utilizing SAS and Microsoft Office.
Frequently Asked Data Science Interview Questions
These are the most frequently asked Data Science Interview Questions that you will most probably face in any Data Science Interview.
What is the need for principal component analysis?
Answer: The main aim of principal components analysis is to report hidden structure in a data set. In so doing, we may be able to
a. Basically, it is prior to identifying how different variables work together.
b. Then reduce the dimensionality of the data.
c. Afterward, it decreases redundancy in the data.
d. Filter some of the noise in the data.
e. Then compress the data.
f. Moreover, prepare the data for further analysis using other techniques.
Can you cite some examples where a false positive is important than a false negative?
Answer: Let us first understand what false positives and false negatives are. False positives are the cases where you wrongly classified a non-event as an event a.k.a Type I error. False negatives are the cases where you wrongly classify events as non-events, a.k.a Type II error.
Example 1: In the medical field, assume you have to give chemotherapy to patients. Assume a patient comes to that hospital and he is tested positive for cancer, based on the lab prediction but he actually doesn’t have cancer. This is a case of false positive. Here it is of the utmost danger to start chemotherapy on this patient when he actually does not have cancer. In the absence of cancerous cell, chemotherapy will do certain damage to his normal healthy cells and might lead to severe diseases, even cancer.
Example 2: Let’s say an e-commerce company decided to give $1000 Gift voucher to the customers whom they assume to purchase at least $10,000 worth of items. They send free voucher mail directly to 100 customers without any minimum purchase condition because they assume to make at least 20% profit on sold items above $10,000. Now the issue is if we send the $1000 gift vouchers to customers who have not actually purchased anything but are marked as having made $10,000 worth of purchase.
What are the requirements of a Data Science program?
Answer: The Data Science specialization requires three semesters of calculus (MATH 1310, MATH 2320, and MATH 2330 or MATH 2350), linear algebra (MATH 3320), introduction to programming (CS 2010), probability and statistics I (MATH 4410), and regression analysis (STAT 4020). In addition, the requirements include:
a) MATH 2950 Introduction to Data Science. This one-hour seminar would introduce freshmen students to a variety of data-science applications and give them an introduction to programming.
b) MATH 3430 Computing with Data. This course will focus on the data wrangling and data exploration computational skills in the context of a modern computing language such as Python or R.
c) MATH 3440 Statistical Programming. This course will focus on writing scripts and functions using a modern statistical language such as R.
d) MATH 4440 Statistical Learning. This course deals with modern methods for modeling data including a variety of supervised and unsupervised methods.
In addition, the student will be asked to choose two of the following seven courses:
a) MATH 4320 Linear Algebra with Applications
b) MATH 4420 Probability and Statistics II
c) MATH 4470 Exploratory Data Analysis
d) CS 2020 Object-Oriented Programming
e) STAT 4440 Data Mining in Business Analytics
f) CS 4400 Optimization Techniques
g) CS 4620 Database Management Systems
What do you understand by the term Normal Distribution?
Answer: It is a set of a continuous variable spread across a normal curve or in the shape of a bell curve. It can be considered as a continuous probability distribution and is useful in statistics. It is the most common distribution curve and it becomes very useful to analyze the variables and their relationships when we have the normal distribution curve.
The normal distribution curve is symmetrical. The non-normal distribution approaches the normal distribution as the size of the samples increases. It is also very easy to deploy the Central Limit Theorem. This method helps to make sense of data that is random by creating an order and interpreting the results using a bell-shaped graph.
What is a Linear Regression?
Answer: Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
How will you define the number of clusters in a clustering algorithm?
Answer: Though the Clustering Algorithm is not specified, this question is mostly in reference to K-Means clustering where “K” defines the number of clusters. The objective of clustering is to group similar entities in a way that the entities within a group are similar to each other but the groups are different from each other.
For example, the following image shows three different groups.
Clustering – Data Science Interview Questions – SVR Within Sum of squares is generally used to explain the homogeneity within a cluster. If you plot WSS for a range of a number of clusters, you will get the plot shown below.
Why data cleaning plays a vital role in the analysis?
Answer: Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task. (data science training)
What is logistic regression? Or State an example when you have used logistic regression recently?
Answer: Logistic Regression often referred to as the logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent on election campaigning of a particular candidate, the amount of time spent in campaigning, etc.
Name Methods to Calculate Distance Measure?
Answer : Euclidean distance
Taxicab or Manhattan distance
Minkowski
Cosine similarity
Mahalanobis distance
Pearson’s Correlation Coefficient(discussed in the above paragraph)
What is Collaborative filtering?
Answer : The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources, and multiple agents.
How machine learning is deployed in real-world scenarios?
Answer: Here are some of the scenarios in which machine learning finds applications in the real world:
Ecommerce: Understanding the customer churn, deploying targeted advertising, remarketing
Search engine: Ranking pages depending on the personal preferences of the searcher
Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions
Medicare: Designing drugs depending on the patient’s history and needs
Robotics: Machine learning for handling situations that are out of the ordinary
Social media: Understanding relationships and recommending connections
Extraction of information: framing questions for getting answers from databases over the web
Can you cite some examples where both false positive and false negatives are equally important?
Answer: In the banking industry giving loans is the primary source of making money but at the same time if your repayment rate is not good you will not make any profit, rather you will risk huge losses.
Banks don’t want to lose good customers and at the same point in time, they don’t want to acquire bad customers. In this scenario, both the false positives and false negatives become very important to measure.
How can you generate a random number between 1 – 7 with only a die?
Answer: Any die has six sides from 1-6. There is no way to get seven equal outcomes from a single rolling of a die. If we roll the die twice and consider the event of two rolls, we now have 36 different outcomes.
To get our 7 equal outcomes we have to reduce this 36 to a number divisible by 7. We can thus consider only 35 outcomes and exclude the other one.
A simple scenario can be to exclude the combination (6,6), i.e., to roll the die again if 6 appears twice.
All the remaining combinations from (1,1) till (6,5) can be divided into 7 parts of 5 each. This way all the seven sets of outcomes are equally likely.
What are the essential skills and training needed in data Science?
Answer: Communication Storytelling
Statistic Machine Learning Optimization
Big Data Cloud Computing
Business and domain knowledge
Visualization of the self toolboxes
Programming CS fundamentals
Learn more skills needed to be a data scientist. (company)
What is K-means? How can you select K for K-means?
Answer: K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called K clusters. It is deployed for grouping data in order to find similarity in the data.
It includes defining the K centers, one each in a cluster. The clusters are defined into K groups with K being predefined. The K points are selected at random as cluster centers. The objects are assigned to their nearest cluster center. The objects within a cluster are as closely related to one another as possible and differ as much as possible to the objects in other clusters. K-means clustering works very well for large sets of data.
What is Survival Analysis?
Answer: a. Model time to event (esp. failure)a.used in medicine, biology, actuary, finance, engineering, sociology, etc.
b. Able to account for censoring
c. Able to compare between 2+ groups
d. Able to access the relationship between covariates and survival time
Install-Package
install.packages(“survival”)
Syntax
Surv(time,event)
survfit(formula)
Description of the parameters −
1. Time is the follow-up time until the event occurs.
2. Generally, the event indicates the status of occurrence of the expected event.
3. Moreover, the formula is the relationship between the predictor variables.
What does the future hold for data scientists?
Answer: After the next 5 years, they will develop the ability to utilize all sorts of data in real-time. Furthermore, this is the needs of the future, and it will spark the emergence of new data science paradigms.
Moreover, we can use more data to drive key business decisions. We will enable innovations like “Deep Learning”. Also, allows for accurate predictions and decision making. Further, modern applications have brought to fore new statistical paradigms.
What is Unsupervised learning?
Answer: Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses.
Algorithms: Clustering, Anomaly Detection, Neural Networks, and Latent Variable Models
E.g. In the same example, a fruit clustering will categorize as “fruits with soft skin and lots of dimples”, “fruits with shiny hard skin” and “elongated yellow fruits”.
During analysis, how do you treat missing values?
Answer: The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights.
If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. Assigning a default value which can be the mean, minimum or maximum value. Getting into the data is important.
If it is a categorical variable, the default value is assigned. The missing value is assigned a default value. If you have a distribution of data coming, for normal distribution give the mean value.
If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.
What is term Pearson’s Correlation Coefficient?
Answer: This is a statistical technique. Also, provides us a number. That tells how strongly or weekly the relationship between the objects. Basically, it is not a measure describing the distance. But a measure describes the bound between the objects. Also, it is used to represent correlation value by small letter ‘r’ and ‘r’ can range from -1 to +1.
If r is close to 0, it means there is no relationship between the objects.
If r is positive, it means that as one object gets larger the other gets larger.
If r is negative that means one gets larger, the other gets smaller.
r value =
+.70 or higher Very strong positive relationship
+.40 to +.69 Strong positive relationship
+.30 to +.39 Moderate positive relationship
+.20 to +.29 weak positive relationship
+.01 to +.19 No or negligible relationship
0 No relationship
-.01 to -.19 No or negligible relationship
-.20 to -.29 weak negative relationship
-.30 to -.39 Moderate negative relationship
-.40 to -.69 Strong negative relationship
-.70 or higher Very strong negative relationship
In today’s world, there are several methods for computing correlation measures ‘r’. Also, out of which Pearson’s correlation coefficient has commonly used a method.