# Data Science Interview Questions And Answers

1. What is Data Science?

Answer: Data science is defined as a multidisciplinary subject used to extract meaningful insights out of different types of data by employing various scientific methods such as scientific processes and algorithms. Data science helps in solving analytically complex problems in a simplified way. It acts as a stream where you can utilize raw data to generate business value.

2. What is systematic sampling?

Answer: Systematic sampling is a technique, and the name resembles that it follows some systematic way, and the samples are selected from an ordered sampling frame. In systematic sampling, the list is actually in a circular manner and the selection starts from one end and reaches the final, and the cycle goes on. Equal probability method would be the best example for the systematic sampling.

3. What is a Boltzmann machine?

Answer: Boltzmann developed with simple learning algorithms that allow them to find the important information that was presented in the complex regularities in the data. These machines are generally used to optimize the quantity and the weights of the given problem. The learning program works very slow in networks due to many layers of feature detectors. When we consider Restricted Boltzmann Machines, this has a single algorithm feature detectors that make it faster compared to others.

4. What is the difference between Supervised Learning an Unsupervised Learning?

Answer: If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning.

5. Explain cross-validation?

Answer: It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.

The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.

6. Explain Star Schema?

Answer: It is a traditional database schema with a central table. Satellite tables map ID’s to physical name or description and can be connected to the central fact table using the ID fields; these tables are known as lookup tables, and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.

7. Why do you want to work at our company?

Answer: This question should be answered as a product of your research and reflection. What makes the company you’re interviewing with unique? How does your education prepare you for this specific position? Why would you choose this company over others, and why would you fit in there? These are all questions to ask yourself before going.

8. Why do you want to work as a data scientist?

Answer: This question plays off of your definition of data science. However, now recruiters are looking to understand what you’ll contribute and what you’ll gain from this field. Focus on what makes your path to becoming a data scientist unique – whether it be a mentor or a preferred method of data extraction.

9. What is the difference between Cluster and Systematic Sampling?

Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed circularly so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is the equal probability method.

10. How can outlier values be treated?

Answer: Outlier values can be identified by using univariate or any other graphical analysis method. If the number of outlier values is few then they can be assessed individually but for a large number of outliers, the values can be substituted with either the 99th or the 1st percentile values. All extreme values are not outlier values. The most common ways to treat outlier values

11. What are the feature vectors?

Answer: A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.

12. What is the goal of A/B Testing?

Answer: It is a statistical hypothesis testing for a randomized experiment with two variables A and B. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest. An example of this could be identifying the click-through rate for a banner ad.

13. What is Supervised Learning?

Answer: Supervised learning is the machine learning task of inferring a function from labeled training data. The training data consist of a set of training examples.

Algorithms: Support Vector Machines, Regression, Naive Bayes, Decision Trees, K-nearest Neighbor Algorithm and Neural Networks

E.g. If you built a fruit classifier, the labels will be “this is an orange, this is an apple and this is a banana”, based on showing the classifier examples of apples, oranges, and bananas.

14. What are recommender systems?

Answer: Recommender systems are also treated as information filtering systems that work to predict or likeness of a user for a product. These recommender systems are widely used in areas like news, movies, social tags, music, products, etc.

We can see the movie recommenders in Netflix, IMDB, & bookMyShow, and product recommender e-commerce sites like eBay, Amazon, Flipcart, Youtube video recommendations, and game recommendations. (data science interview questions and answers)

15. What is Reinforcement learning?

Answer: Reinforcement learning maps the situations to what to do and how to map actions. The result of this Reinforcement learning is to maximize the numerical reward signal. The learner is not defined with what action to do next but instead must discover which actions will give the maximum reward. Reinforcement learning is developed from the learning process of human beings. It works based on the reward/penalty mechanism.

16. What is logistic regression? State an example when you have used logistic regression recently?

Answer: Logistic Regression often referred to as the logit model is a technique to predict the binary outcome from a linear combination of predictor variables.

For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent on election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

17. What are Artificial Neural Networks?

Answer: Artificial neural networks are the main elements that have made the machine learning popular. These neural networks are developed based on the functionality of a human brain. Artificial neural networks are trained to learn from the examples and experiences without being programmed explicitly. Artificial neural networks work based on nodes called artificial neurons that are connected. Each connection acts similar to synapses in the human brain that helps in transmitting the signals between the artificial neurons.

18. What is an Auto-Encoder?

Answer: The Auto-Encoders are learning networks that work for transforming the inputs into outputs with no errors or minimized error. It means the output must be very close to the input. We add a few layers between the input and output and the sizes of these layers would be smaller than the input layer. The Auto-encoder is provided with the unlabelled input then it would be transmitted into reconstructing the input.

19. Why data cleaning plays a vital role in the analysis?

Answer: Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.

20. How can you assess a good logistic model?

Answer: There are various methods to assess the results of logistic regression analysis-

Using the Classification Matrix to look at the true negatives and false positives.

A concordance helps identify the ability of the logistic model to differentiate between the event happening and not happening.

Lift helps assess the logistic model by comparing it with random selection.

21. How often should an algorithm be updated?

Answer: This quasi-trick question has no specific time-based answer. This is because an algorithm should be updated whenever the underlying data is changing or when you want the model to evolve. Understanding the outcomes of dynamic algorithms is key to answering this question with confidence.

22. What is logistic regression? Or State an example when you have used logistic regression recently?

Answer: Logistic Regression often referred to as the logit model is a technique to predict the binary outcome from a linear combination of predictor variables. For example, if you want to predict whether a particular political leader will win the election or not. In this case, the outcome of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the amount of money spent on election campaigning of a particular candidate, the amount of time spent in campaigning, etc.

23. During analysis, how do you treat missing values?

Answer: The extent of the missing values is identified after identifying the variables with missing values. If any patterns are identified the analyst has to concentrate on them as it could lead to interesting and meaningful business insights. If there are no patterns identified, then the missing values can be substituted with mean or median values (imputation) or they can simply be ignored. There are various factors to be considered when answering this question-

Understand the problem statement, understand the data and then give the answer. Assigning a default value which can be the mean, minimum or maximum value. Getting into the data is important.

If it is a categorical variable, the default value is assigned. The missing value is assigned a default value.

If you have a distribution of data coming, for normal distribution give the mean value.

Should we even treat missing values is another important point to consider? If 80% of the values for a variable are missing then you can answer that you would be dropping the variable instead of treating the missing values.

24. What is the Central Limit Theorem and why is it important?

Answer: “Suppose that we are interested in estimating the average height among all people. Collecting data for every person in the world is impossible. While we can’t obtain a height measurement from everyone in the population, we can still sample some people. The question now becomes, what can we say about the average height of the entire population given a single sample. The Central Limit Theorem addresses this question exactly.

25. Do gradient descent methods at all times converge to a similar point?

Answer: No, they do not because in some cases they reach a local minimum or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.

26. What is the Law of Large Numbers?

Answer: It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.

27. What are the confounding variables?

Answer: These are extraneous variables in a statistical model that correlates directly or inversely with both the dependent and the independent variable. The estimate fails to account for the confounding factor.

28. What are the differences between overfitting and underfitting?

Answer: In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, to be able to make reliable predictions on general untrained data.

In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitting has poor predictive performance, as it overreacts to minor fluctuations in the training data.

Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.

29. What do you mean by Deep Learning and Why has it become popular now?

Answer: Deep Learning is nothing but a paradigm of machine learning which has shown incredible promise in recent years. This is because Deep Learning shows a great analogy with the functioning of the human brain.

Now although Deep Learning has been around for many years, the major breakthroughs from these techniques came just in recent years. This is because of two main reasons:

The increase in the amount of data generated through various sources

The growth in hardware resources required to run these models

GPUs are multiple times faster and they help us build bigger and deeper deep learning models in comparatively less time than we required previously.

30. What is the role of the Activation Function?

Answer: The Activation function is used to introduce non-linearity into the neural network helping it to learn more complex functions. Without which the neural network would be only able to learn linear function which is a linear combination of its input data. An activation function is a function in an artificial neuron that delivers an output based on inputs.

31. Could you draw a comparison between overfitting and underfitting?

Answer: To make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.

Following are the differences between overfitting and underfitting:

Definition – A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.

Occurrence – When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. An example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.

Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, how each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.

32. Can you compare the validation set with the test set?

Answer: A validation set is part of the training set used for parameter selection as well as for avoiding overfitting of the machine learning model being developed. On the contrary, a test set is meant for evaluating or testing the performance of a trained machine learning model.

33. What are outlier values and how do you treat them?

Answer: Outlier values, or simply outliers, are data points in statistics that don’t belong to a certain population. An outlier value is an abnormal observation that is very much different from other values belonging to the set.

Identification of outlier values can be done by using univariate or some other graphical analysis method. Few outlier values can be assessed individually but assessing a large set of outlier values require the substitution of the same with either the 99th or the 1st percentile values.

There are two popular ways of treating outlier values:

To change the value so that it can be brought within a range

To simply remove the value

Note: Not all extreme values are outlier values.

34. What do you understand by Deep Learning?

Answer: Deep Learning is a paradigm of machine learning that displays a great degree of analogy with the functioning of the human brain. It is a neural network method based on convolutional neural networks (CNN).

Deep learning has a wide array of uses, ranging from social network filtering to medical image analysis and speech recognition. Although Deep Learning has been present for a long time, it’s only recently that it has gained worldwide acclaim. This is mainly due to:

An increase in the amount of data generation via various sources

The growth in hardware resources required for running Deep Learning models

Caffe, Chainer, Keras, Microsoft Cognitive Toolkit, Pytorch, and TensorFlow are some of the most popular Deep Learning frameworks as of today.

35. How to understand the problems faced during data analysis?

Answer: Most of the problem faced during hands-on analysis or data science is because of poor understanding of the problem in hand and concentrating more on tools, results and other aspects of the project.

Breaking the problem down to a granular level and understanding takes a lot of time and practice to master. Coming back to square one in data science projects can be seen in a lot of companies and even in your project or kaggle problems.

36. How can I achieve accuracy in the first model that I built?

Answer: Building machine learning models involves a lot of interesting steps. 90% accuracy models don’t come in the very first attempt. You have done a lot of better feature selection techniques to get that point, which means it involves a lot of trial and error. The process will help you learn new concepts in statistics, math, and probability.

37. What is the difference between Machine learning Vs Data Mining?

Answer: Data mining is about working on unlimited data and then extract it to a level anywhere the unusual and unknown patterns are identified. Machine learning is any method about a study whether it closely relates to design, development concerning the algorithms that provide an ability to certain computers to capacity to learn.

38. Why is data cleaning essential in Data Science?

Answer: Data cleaning is more important in Data Science because the results or the outcomes of the data analysis come from the existing data where useless or unimportant need to be cleaned periodically as of when not required. This ensures the data reliability & accuracy and also memory is freed up.

Data cleaning reduces the data redundancy and gives good results in data analysis where some large customer information exists and that should be cleaned periodically. In businesses like e-commerce, retail, government organizations contain large customer transaction information which is outdated and needs to be cleaned.

Depending on the amount or size of data, suitable tools or methods should be used to clean the data from the database or big data environment. There are different types of data existing in a data source such as dirty data, clean data, mixed clean and dirty data and sample clean data.

Modern data science applications rely on the machine learning model where the learner learns from the existing data. So, the existing data should always be cleanly and well maintained to get sophisticated and good outcomes during the optimization of the system.

39. What is a Linear Regression in Data Science?

Answer: This is the frequently asked Data Science Interview Questions in an interview. Linear Regression is a technique used in supervised machine learning the algorithmic process in the area of Data Science. This method is used for predictive analysis.

Predictive analytics is an area within Statistical Sciences where the existing information will be extracted and processed to predict the trends and outcomes pattern. The core of the subject lies in the analysis of existing context to predict an unknown event.

The process of the Linear Regression method is to predict a variable called target variable by making the best relationship between the dependent variable and an independent variable. Here the dependent variable is the outcome variable and also response variable whereas the independent variable is the predictor variable or explanatory variable.

For example in real life, depending on the expenses occurred in this financial year or monthly expenses, the predictions happen by calculating the approximate upcoming months or financial year expenses.

In this method, the implementation can be done by using a Python programming technique where this is the most important method used in the Machine Learning technique under the area of Data Science.

Linear regression is also called Regression analysis that comes under the area of Statistical Sciences which is integrated with Data Science Training Online.

40. What is A/B testing in Data Science?

Answer: A/B testing is also called Bucket Testing or Split Testing. This is the method of comparing and testing two versions of systems or applications against each other to determine which version of application performs better. This is important in cases where multiple versions are shown to the customers or end-users to achieve the goals.

In the area of Data Science, this A/B testing is used to know which variable out of the existing two variables to optimize or increase the outcome of the goal. A/B testing is also called the Design of Experiment. This testing helps in establishing a cause and effect relationship between the independent and dependent variables.

This testing is also simply a combination of design experimentation or statistical inference. Significance, Randomization and Multiple Comparisons are the key elements of the A/B testing.

The significance is the term for the significance of statistical tests conducted. Randomization is the core component of the experimental design where the variables will be balanced. Multiple comparisons are the way of comparing more variables in the case of customer interests that causes more false positives resulting in the requirement of correction in the confidence level of a seller in the area of e-commerce.

41. Why Data Cleansing Is Important In Data Analysis?

Answer: With data coming in from multiple sources it is important to ensure that data is good enough for analysis. This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process of detecting and correcting data records, ensuring that data is complete and accurate and the components of data that are irrelevant are deleted or modified as per the needs. This process can be deployed in concurrence with data wrangling or batch processing.

Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an essential part of data science because the data can be prone to error due to human negligence, corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at which it comes.

42. What Do You Understand By The Term Normal Distribution?

Answer: It is a set of continuous variables spread across a normal curve or in the shape of a bell curve. It can be considered as a continuous probability distribution and is useful in statistics. It is the most common distribution curve and it becomes very useful to analyze the variables and their relationships when we have the normal distribution curve.

The normal distribution curve is symmetrical. The non-normal distribution approaches the normal distribution as the size of the samples increases. It is also very easy to deploy the Central Limit Theorem. This method helps to make sense of data that is random by creating an order and interpreting the results using a bell-shaped graph.

43. What is Root cause Analysis?

Answer: The Root cause analysis was initially used to analyze industrial accidents, but now it has been extended into many areas. It is a technique that is being used to identify the root cause of a particular problem.

44. Explain the difference between overfitting and underfitting?

Answer: In machine learning as well as in statistics, the common task to undergo is to fit a model to a set of training data. It helps us in making reliable predictions using general untrained data.

In overfitting, a statistical model will help us in letting know the random noise or errors instead of the underlying relationship. Overfitting comes into light when the data is associated with too much complexity, which means it is associated with so many parameters relative to the number of observations. A model that is overfitted is always performed poor in predictive performance and acts overly to the minor fluctuations in the training data.

Unnderfittinng happens when a machine learning algorithm or statistical model is unable to focus on the underlying insights of the data. The case when you are trying to fix a linear model to a nonlinear one. This kind of model would result in poor predictive performance.

45. What is Cluster Sampling?

Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.

For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects several clusters depending on his research through simple or systematic random sampling.

46. What Are Interpolation And Extrapolation?

Answer: The terms of interpolation and extrapolation are extremely important in any statistical analysis. Extrapolation is the determination or estimation using a known set of values or facts by extending it and taking it to an area or region that is unknown. It is the technique of inferring something using data that is available.

Interpolation, on the other hand, is the method of determining a certain value that falls between a certain set of values or the sequence of values.

This is especially useful when you have data at the two extremities of a certain region but you don’t have enough data points at a specific point. This is when you deploy interpolation to determine the value that you need.

47. What Are K-means? How Can You Select K For K-means?

Answer: K-means clustering can be termed as the basic unsupervised learning algorithm. It is the method of classifying data using a certain set of clusters called K clusters. It is deployed for grouping data to find similarity in the data.

It includes defining the K centers, one each in a cluster. The clusters are defined into K groups with K being predefined. The K points are selected at random as cluster centers. The objects are assigned to their nearest cluster center. The objects within a cluster are as closely related to one another as possible and differ as much as possible to the objects in other clusters. K-means clustering works very well for large sets of data.

48. How Machine Learning Is Deployed In Real World Scenarios?

Answer: Here are some of the scenarios in which machine learning finds applications in the real world:

Ecommerce: Understanding customer churn, deploying targeted advertising, remarketing.

Search engine: Ranking pages depending on the personal preferences of the searcher

Finance: Evaluating investment opportunities & risks, detecting fraudulent transactions

Medicare: Designing drugs depending on the patient’s history and needs

Robotics: Machine learning for handling situations that are out of the ordinary

Social media: Understanding relationships and recommending connections

Extraction of information: framing questions for getting answers from databases over the web.

49. Now companies are heavily investing their money and time to make the dashboards. Why?

Answer: To make stakeholders more aware of the business through data. Working on visualization projects helps you develop one of the key skills every data scientist should possess i.e. Thinking from the shoes of the end-user.

If you’re learning any visualization tool, download a dataset from kaggle. Building charts and graphs for the dashboard should be the last step. Research more about the domain and think about the KPIs you would like to see in the dashboard if you’re going to be the end-user. Then start building the dashboard piece by piece.

50. What does SAS stand out to be the best over other data analytics tools?

Answer: Ease to understand: The provisions included in SAS are remarkably easy to learn. Further, it offers the most suitable option for those who already are aware of the SQL. On the other hand, R comes with a steep training cover which is supposed to be a low-level programming style.

Data Handling Capacities: it is at par the most leading tool which also includes the R& Python.

If it advances before handling the huge data, it is the best platform to engage Graphical Capacities: it comes with functional graphical capacities and has a limited knowledge field.

It is useful to customize the plots Better tool management: It benefits in release updates with regards to the controlled conditions.

This is the main reason why it is well tested. Whereas if you considered R& Python, it has an open contribution also the risk of errors in the current development is also high.