Data Science Interview Questions Pdf
1. What is selection bias and why does it matter?
Answer: Selection bias is a product of inadequately or improperly randomized data leading to data sets that are not representative of the whole. In an interview, you should express the importance of this in terms of its effect on your solution. If your data is not representative, your solutions likely are not either.
2. What is Data Science?
Answer: Data science is defined as a multidisciplinary subject used to extract meaningful insights out of different types of data by employing various scientific methods such as scientific processes and algorithms. Data science helps in solving the analytically complex problems in a simplified way. It acts as a stream where you can utilize raw data to generate business value.
3. Explain the difference between overfitting and underfitting?
Answer: In machine learning as well as in statistics, the common task to undergo is to fit a model to a set of training data. It helps us in making reliable predictions using general untrained data.
In overfitting, a statistical model will help us in letting know the random noise or errors instead of the underlying relationship. Overfitting comes into light when the data is associated with too much complexity, which means it is associated with so many parameters relative to the number of observations. A model that is overfitted is always performed poor in predictive performance and acts overly to the minor fluctuations in the training data.
Unnderfittinng happens when a machine learning algorithm or statistical model is unable to focus on the underlying insights of the data. The case when you are trying to fix a linear model to a nonlinear one. This kind of model would result in poor predictive performance.
4. What are Artificial Neural Networks?
Answer: Artificial neural networks are the main elements which have made the machine learning popular. These neural networks are developed based on the functionality of a human brain. The Artificial neural networks are trained to learn from the examples and experiences without being programmed explicitly. Artificial neural networks work based on nodes called artificial neurons that are connected to one another. Each connection acts similar to synapses in the human brain that helps in transmitting the signals between the artificial neurons.
5. What is a Random Forest?
Answer: Random forest is a versatile method in machine learning that performs both classification and regression tasks. It also helps in areas like treats missing values, dimensionality reduction, and outlier values. It is like gathering the various weak modules comes together to form a robust model.
6. What is Reinforcement learning?
Answer: Reinforcement learning maps the situations to what to do and how to map actions. The end result of this Reinforcement learning is to maximize the numerical reward signal. The learner is not defined with what action to do next but instead must discover which actions will give the maximum reward. Reinforcement learning is developed from the learning process of human beings. It works based on the reward/penalty mechanism.
7. What is the difference between Cluster and Systematic Sampling?
Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
8. What is the difference between Supervised Learning an Unsupervised Learning?
Answer: If an algorithm learns something from the training data so that the knowledge can be applied to the test data, then it is referred to as Supervised Learning. Classification is an example for Supervised Learning. If the algorithm does not learn anything beforehand because there is no response variable or any training data, then it is referred to as unsupervised learning. Clustering is an example of unsupervised learning. (E learning portal)
9. What is the difference between Bayesian Estimate and Maximum Likelihood Estimation (MLE)?
Answer: In the bayesian estimate, we have some knowledge about the data/problem (prior). There may be several values of the parameters which explain data and hence we can look for multiple parameters like 5 gammas and 5 lambdas that do this. As a result of Bayesian Estimate, we get multiple models for making multiple predictions i.e. one for each pair of parameters but with the same prior. So, if a new example needs to be predicted than computing the weighted sum of these predictions serves the purpose.
Maximum likelihood does not take prior into consideration (ignores the prior) so it is like being a Bayesian while using some kind of a flat prior.
10. What are the feature vectors?
Answer: A feature vector is an n-dimensional vector of numerical features that represent some object. In machine learning, feature vectors are used to represent numeric or symbolic characteristics, called features, of an object in a mathematical, easily analyzable way.
11. What is the Law of Large Numbers?
Answer: It is a theorem that describes the result of performing the same experiment a large number of times. This theorem forms the basis of frequency-style thinking. It says that the sample means, the sample variance and the sample standard deviation converge to what they are trying to estimate.
12. What is the role of the Activation Function?
Answer: The Activation function is used to introduce non-linearity into the neural network helping it to learn more complex function. Without which the neural network would be only able to learn linear function which is a linear combination of its input data. An activation function is a function in an artificial neuron that delivers an output based on inputs.
13. Could you draw a comparison between overfitting and underfitting?
Answer: In order to make reliable predictions on general untrained data in machine learning and statistics, it is required to fit a (machine learning) model to a set of training data. Overfitting and underfitting are two of the most common modeling errors that occur while doing so.
Following are the various differences between overfitting and underfitting:
Definition: A statistical model suffering from overfitting describes some random error or noise in place of the underlying relationship. When underfitting occurs, a statistical model or machine learning algorithm fails in capturing the underlying trend of the data.
Occurrence: When a statistical model or machine learning algorithm is excessively complex, it can result in overfitting. Example of a complex model is one having too many parameters when compared to the total number of observations. Underfitting occurs when trying to fit a linear model to non-linear data.
Poor Predictive Performance – Although both overfitting and underfitting yield poor predictive performance, the way in which each one of them does so is different. While the overfitted model overreacts to minor fluctuations in the training data, the underfit model under-reacts to even bigger fluctuations.
14. What do you mean by cluster sampling and systematic sampling?
Answer: When studying the target population spread throughout a wide area becomes difficult and applying simple random sampling becomes ineffective, the technique of cluster sampling is used. A cluster sample is a probability sample, in which each of the sampling units is a collection or cluster of elements.
Following the technique of systematic sampling, elements are chosen from an ordered sampling frame. The list is advanced in a circular fashion. This is done in such a way so that once the end of the list is reached, the same is progressed from the start, or top, again.
15. How and by what methods data visualizations can be effectively used?
Answer: In addition to giving insights in a very effective and efficient manner, data visualization can also be used in such a way that it is not only restricted to bar, line or some stereotypic graphs. Data can be represented in a much more visually pleasing manner.
One thing has to be taken care of is to convey the intended insight or finding correctly to the audience. Once the baseline is set. Innovative and creative part can help you come up with better looking and functional dashboards. There is a fine line between the simple insightful dashboard and awesome looking 0 fruitful insight dashboards.
16. What is the best Programming Language to use in Data Science?
Answers: Data Science can be handled by using programming languages like Python or R programming language. These two are the two most popular languages being used by the Data Scientists or Data Analysts. R and Python are open source and are free to use and came into existence during the 1990s.
Python and R have different advantages depending on the applications and required a business goal. Python is better to be used in the cases of repeated tasks or jobs and for data manipulations whereas R programming can be used for querying or retrieving datasets and customized data analysis.
Mostly Python is preferred for all types of data science applications where some time R programming is preferred in the cases of high or complex data applications. Python is easier to learn and has less learning curve whereas R has a deep learning curve.
Python is mostly preferred in all the cases which is a general-purpose programming language and can be found in many applications other than Data Science too. R is mostly seen in Data Science area only where it is used for data analysis in standalone servers or computing separately.
17. What Is A Recommender System?
Answer: A recommender system is a today widely deployed in multiple fields like movie recommendations, music preferences, social tags, research articles, search queries and so on. The recommender systems work as per collaborative and content-based filtering or by deploying a personality-based approach. This type of system works based on a person’s past behavior in order to build a model for the future. This will predict future product buying, movie viewing or book reading by people. It also creates a filtering approach using the discrete characteristics of items while recommending additional items.
18. How Do Data Scientists Use Statistics?
Answer: Statistics help Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.
19. Why Data Cleansing Is Important In Data Analysis?
Answer: With data coming in from multiple sources it is important to ensure that data is good enough for analysis. This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process of detecting and correcting data records, ensuring that data is complete and accurate and the components of data that are irrelevant are deleted or modified as per the needs. This process can be deployed in concurrence with data wrangling or batch processing.
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an essential part of the data science because the data can be prone to error due to human negligence, corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at which it comes.
20. How Is Data Modeling Different From Database Design?
Answer: Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques.
Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.
21. Explain Star Schema.?
Answer: It is a traditional database schema with a central table. Satellite tables map ID’s to physical name or description and can be connected to the central fact table using the ID fields; these tables are known as lookup tables, and are principally useful in real-time applications, as they save a lot of memory. Sometimes star schemas involve several layers of summarization to recover information faster.
22. What do you mean by word Data Science?
Answer: Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining.
23. What do you mean by word Data Science?
Answer: Data Science is the extraction of knowledge from large volumes of data that are structured or unstructured, which is a continuation of the field data mining and predictive analytics, It is also known as knowledge discovery and data mining.
24. Why data cleaning plays a vital role in the analysis?
Answer: Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process because – as the number of data sources increases, the time take to clean the data increases exponentially due to the number of sources and the volume of data generated in these sources. It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
25. What are an Eigenvalue and Eigenvector?
Answer: Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching. Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
26. How do you define data science?
Answer: This question allows you to show your interviewer who you are. For example, what’s your favorite part of the process, or what’s the most impactful project you’ve worked on? Focus first on what data science is to everyone – a means of extracting insights from numbers – then explain what makes it personal.
27. How have you overcome a barrier to finding a solution?
Answer: This question directly asks you to draw upon your experiences and your ability to problem-solve. Data scientists are, after all, numbers-based problem-solvers, so, it’s important to determine an example of a problem you’ve solved ahead of time. Whether it’s through re-cleaning data or using a different program, you should be able to explain your process to the recruiter.
28. Do you prefer Python or R for text analytics?
Answer: Here, you’re being asked to insert your own opinion. However, most data scientists agree that the right opinion is Python. This is because Python has Pandas library which has strong data analysis tools and an easy-to-use structure. What’s more, Python is typically faster for text analytics.
29. How can you assess a good logistic model?
Answer: There are various methods to assess the results of logistic regression analysis-
Using Classification Matrix to look at the true negatives and false positives.
Concordance that helps identify the ability of the logistic model to differentiate between the event happening and not happening.
Lift helps assess the logistic model by comparing it with random selection.
30. Why is data cleaning important for analysis?
Answer: This is a knowledge-based question with a relatively simple answer. So much of a data scientist’s time goes into cleaning data – and as the data gets bigger, so does the time it takes to clean. Cleaning it right is the foundation of analysis, and the time it takes to clean data, alone, makes it important.
31. How often should an algorithm be updated?
Answer: This quasi-trick question has no specific time-based answer. This is because an algorithm should be updated whenever the underlying data is changing or when you want the model to evolve over time. Understanding the outcomes of dynamic algorithms is key to answering this question with confidence.
32. What are Eigenvectors and Eigenvalues?
Answer: Eigenvectors are used for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvectors are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
Eigenvalue can be referred to as the strength of the transformation in the direction of eigenvector or the factor by which the compression occurs.
33. What do you mean by Deep Learning and Why has it become popular now?
Answer: Deep Learning is nothing but a paradigm of machine learning which has shown incredible promise in recent years. This is because of the fact that Deep Learning shows a great analogy with the functioning of the human brain.
Now although Deep Learning has been around for many years, the major breakthroughs from these techniques came just in recent years. This is because of two main reasons:
The increase in the amount of data generated through various sources
The growth in hardware resources required to run these models
GPUs are multiple times faster and they help us build bigger and deeper deep learning models in comparatively less time than we required previously.
34. Can you enumerate the various differences between Supervised and Unsupervised Learning?
Answer: Supervised learning is a type of machine learning where a function is inferred from labeled training data. The training data contains a set of training examples.
Unsupervised learning, on the other hand, is a type of machine learning where inferences are drawn from datasets containing input data without labeled responses. Following are the various other differences between the two types of machine learning:
Algorithms Used: Supervised learning makes use of Decision Trees, K-nearest Neighbor algorithm, Neural Networks, Regression, and Support Vector Machines. Unsupervised learning uses Anomaly Detection, Clustering, Latent Variable Models, and Neural Networks.
Enables – Supervised learning enables classification and regression, whereas unsupervised learning enables classification, dimension reduction, and density estimation
Use: While supervised learning is used for prediction, unsupervised learning finds use in analysis
35. How does data cleaning plays a vital role in the analysis?
Answer: Data cleaning can help in the analysis because:
Cleaning data from multiple sources helps to transform it into a format that data analysts or data scientists can work with.
Data Cleaning helps to increase the accuracy of the model in machine learning.
It is a cumbersome process because as the number of data sources increases, the time taken to clean the data increases exponentially due to the number of sources and the volume of data generated by these sources.
It might take up to 80% of the time for just cleaning data making it a critical part of the analysis task.
36. What is Cluster Sampling?
Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements.
For eg., A researcher wants to survey the academic performance of high school students in Japan. He can divide the entire population of Japan into different clusters (cities). Then the researcher selects a number of clusters depending on his research through simple or systematic random sampling.
37. Can you explain the difference between a Validation Set and a Test Set?
Answer: A Validation set can be considered as a part of the training set as it is used for parameter selection and to avoid overfitting of the model being built.
On the other hand, a Test Set is used for testing or evaluating the performance of a trained machine learning model.
In simple terms, the differences can be summarized as; training set is to fit the parameters i.e. weights and test set is to assess the performance of the model i.e. evaluating the predictive power and generalization.
38. What do you understand by the Selection Bias? What are its various types?
Answer: Selection bias is typically associated with research that doesn’t have a random selection of participants. It is a type of error that occurs when a researcher decides who is going to be studied. On some occasions, selection bias is also referred to as the selection effect.
In other words, selection bias is a distortion of statistical analysis that results from the sample collecting method. When selection bias is not taken into account, some conclusions made by a research study might not be accurate. Following are the various types of selection bias:
Sampling Bias – A systematic error resulting due to a non-random sample of a populace causing certain members of the same to be less likely included than others that results in a biased sample.
Time Interval – A trial might be ended at an extreme value, usually due to ethical reasons, but the extreme value is most likely to be reached by the variable with the most variance, even though all variables have a similar mean.
Data – Results when specific data subsets are selected for supporting a conclusion or rejection of bad data arbitrarily.
Attrition – Caused due to attrition, i.e. loss of participants, discounting trial subjects or tests that didn’t run to completion.
39. What do you understand by linear regression and logistic regression?
Answer: Linear regression is a form of statistical technique in which the score of some variable Y is predicted on the basis of the score of a second variable X, referred to as the predictor variable. The Y variable is known as the criterion variable.
Also known as the logit model, logistic regression is a statistical technique for predicting the binary outcome from a linear combination of predictor variables.
40. Could you explain how to define the number of clusters in a clustering algorithm?
Answer: The primary objective of clustering is to group together similar identities in such a way that while entities within a group are similar to each other, the groups remain different from one another.
Generally, Within Sum of Squares is used for explaining the homogeneity within a cluster. For defining the number of clusters in a clustering algorithm, WSS is plotted for a range pertaining to a number of clusters. The resultant graph is known as the Elbow Curve.
The Elbow Curve graph contains a point that represents the point post which there aren’t any decrements in the WSS. This is known as the bending point and represents K in K–Means.
Although the aforementioned is the widely-used approach, another important approach is the Hierarchical clustering. In this approach, dendrograms are created first and then distinct groups are identified from there.
41. How to understand the problems faced during data analysis?
Answer: Most of the problem faced during hands-on analysis or data science is because of poor understanding of the problem in hand and concentrating more on tools, end results and other aspects of the project.
Breaking the problem down to a granular level and understanding takes a lot of time and practice to master. Coming back to square one in data science projects can be seen in a lot of companies and even in your own project or kaggle problems.
42. What is the common perception of visualization?
Answer: People think visualization as just charts and summary information. But they are beyond that and drive business with a lot of underlying principles. Learning design principles can help anyone build effective and efficient visualizations and this Tableau prep tool can drastically increase our time on focusing more important part. The only issue with Tableau is, it is paid and companies need to pay for leveraging that awesome tool.
43. What is the basic responsibility of a Data Scientist?
Answer: As a data scientist, we have the responsibility to make complex things simple enough that anyone without context should understand, what we are trying to convey.
The moment, we start explaining even the simple things the mission of making the complex simple goes away. This happens a lot when we are doing data visualization.
Less is more. Rather than pushing too much information on to readers brain, we need to figure out how easily we can help them consume a dashboard or a chart.
The process is simple to say but difficult to implement. You must bring the complex business value out of a self-explanatory chart. It’s a skill every data scientist should strive towards and good to have in their arsenal.
44. What does SAS stand out to be the best over other data analytics tools?
Answer: Ease to understand: The provisions included in SAS are remarkably easy to learn. Further, it offers the most suitable option for those who already are aware of the SQL. On the other hand, R comes with a steep training cover which is supposed to be a low-level programming style.
Data Handling Capacities: it is at par the most leading tool which also includes the R& Python.
If it advances before handling the huge data, it is the best platform to engage Graphical Capacities: it comes with functional graphical capacities and has a limited knowledge field.
It is useful to customize the plots Better tool management: It benefits in a release the updates with regards to the controlled conditions.
This is the main reason why it is well tested. Whereas if you considered R& Python, it has open contribution also the risk of errors in the current development is also high.
45. Why is data cleaning essential in Data Science?
Answer: Data cleaning is more important in Data Science because the end results or the outcomes of the data analysis come from the existing data where useless or unimportant need to be cleaned periodically as of when not required. This ensures the data reliability & accuracy and also memory is freed up.
Data cleaning reduces the data redundancy and gives good results in data analysis where some large customer information exists and that should be cleaned periodically. In businesses like e-commerce, retail, government organizations contain large customer transaction information which is outdated and needs to be cleaned.
Depending on the amount or size of data, suitable tools or methods should be used to clean the data from the database or big data environment. There are different types of data existing in a data source such as dirty data, clean data, mixed clean and dirty data and sample clean data.
Modern data science applications rely on machine learning model where the learner learns from the existing data. So, the existing data should always be cleanly and well maintained to get sophisticated and good outcomes during the optimization of the system.
46. What is A/B testing in Data Science?
Answer: A/B testing is also called Bucket Testing or Split Testing. This is the method of comparing and testing two versions of systems or applications against each other to determine which version of application performs better. This is important in cases where multiple versions are shown to the customers or end-users in order to achieve the goals.
In the area of Data Science, this A/B testing is used to know which variable out of the existing two variables in order to optimize or increase the outcome of the goal. A/B testing is also called Design of Experiment. This testing helps in establishing a cause and effect relationship between the independent and dependent variables.
This testing is also simply a combination of design experimentation or statistical inference. Significance, Randomization and Multiple Comparisons are the key elements of the A/B testing.
The significance is the term for the significance of statistical tests conducted. Randomization is the core component of the experimental design where the variables will be balanced. Multiple comparisons are the way of comparing more variables in the case of customer interests that causes more false positives resulting in the requirement of correction in the confidence level of a seller in the area of e-commerce.
47. Describe Univariate, Bivariate And Multivariate Analysis?
Answer: As the name suggests these are analysis methodologies having a single, double or multiple variables.
So a univariate analysis will have one variable and due to this, there are no relationships, causes. The major aspect of the univariate analysis is to summarize the data and find the patterns within it to make actionable decisions.
A Bivariate analysis deals with the relationship between two sets of data. These sets of paired data come from related sources, or samples. There are various tools to analyze such data including the chi-squared tests and t-tests when the data are having a correlation.
If the data can be quantified then it can be analyzed using a graph plot or a scatterplot. The strength of the correlation between the two data sets will be tested in a Bivariate analysis.
48. What Are Interpolation And Extrapolation?
Answer: The terms of interpolation and extrapolation are extremely important in any statistical analysis. Extrapolation is the determination or estimation using a known set of values or facts by extending it and taking it to an area or region that is unknown. It is the technique of inferring something using data that is available.
Interpolation, on the other hand, is the method of determining a certain value which falls between a certain set of values or the sequence of values.
This is especially useful when you have data at the two extremities of a certain region but you don’t have enough data points at a specific point. This is when you deploy interpolation to determine the value that you need.
49. Differentiate between Data modeling and Database design?
Answer: Data Modeling – Data modeling (or modeling) in software engineering is the process of creating a data model for an information system by applying formal data modeling techniques.
Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system.
50. What do you understand by term hash table collisions?
Answer: Hash table (hash map) is a kind of data structure used to implement an associative array, a structure that can map keys to values. Ideally, the hash function will assign each key to a unique bucket, but sometimes it is possible that two keys will generate an identical hash causing both keys to point to the same bucket. It is known as hash collisions.
51. Explain Cross-validation?
Answer: It is a model validation technique for evaluating how the outcomes of a statistical analysis will generalize to an independent data set. Mainly used in backgrounds where the objective is forecast and one wants to estimate how accurately a model will accomplish in practice.
The goal of cross-validation is to term a data set to test the model in the training phase (i.e. validation data set) in order to limit problems like overfitting and get an insight on how the model will generalize to an independent data set.