1. What is the difference between Cluster and Systematic Sampling?
Answer: Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. Cluster Sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method.
2. What does P-value signify about the statistical data?
Answer: P-value is used to determine the significance of results after a hypothesis test in statistics. P-value helps the readers to draw conclusions and is always between 0 and 1.
• P-Value > 0.05 denotes weak evidence against the null hypothesis which means the null hypothesis cannot be rejected.
• P-value <= 0.05 denotes strong evidence against the null hypothesis which means the null hypothesis can be rejected.
• P-value=0.05is the marginal value indicating it is possible to go either way.
3. A test has a true positive rate of 100% and a false-positive rate of 5%. There is a population with a 1/1000 rate of having the condition the test identifies. Considering a positive test, what is the probability of having that condition?
Answer: Let’s suppose you are being tested for a disease if you have the illness the test will end up saying you have the illness. However, if you don’t have the illness- 5% of the times the test will end up saying you have the illness and 95% of the times the test will give an accurate result that you don’t have the illness. Thus there is a 5% error in case you do not have the illness.
Out of 1000 people, 1 person who has the disease will get true positive result.
Out of the remaining 999 people, 5% will also get true positive result.
Close to 50 people will get a true positive result for the disease.
This means that out of 1000 people, 51 people will be tested positive for the disease even though only one person has the illness. There is only a 2% probability of you having the disease even if your reports say that you have the disease.
4. What is the goal of A/B Testing?
Answer: It is a statistical hypothesis testing for a randomized experiment with two variables A and B.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of interest.
An example of this could be identifying the click-through rate for a banner ad.
5. Python or R – Which one would you prefer for text analytics?
We will prefer Python because of the following reasons:
Python would be the best option because it has Pandas library that provides easy to use data structures and high-performance data analysis tools.
R is more suitable for machine learning than just text analysis.
Python performs faster for all types of text analytics.
6. What is Systematic Sampling?
Answer: Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example of systematic sampling is equal probability method. (E learning portal)
7. Which technique is used to predict categorical responses?
Answer: Classification technique is used widely in mining for classifying data sets.
8. What are Recommender Systems?
Answer: A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.
9. What is power analysis?
Answer: power analysis is a vital part of the experimental design. It is involved with the process of determining the sample size needed for detecting an effect of a given size from a cause with a certain degree of assurance. It lets you deploy a specific probability in a sample size constraint.
The various techniques of statistical power analysis and sample size estimation are widely deployed for making statistical judgment that is accurate and evaluates the size needed for experimental effects in practice.
Power analysis lets you understand the sample size estimate so that they are neither high nor low. A low sample size there will be no authentication to provide reliable answers and if it is large there will be wastage of resources.
10. How is Data modeling different from Database design?
Data Modeling: It can be considered as the first step towards the design of a database. Data modeling creates a conceptual model based on the relationship between various data models. The process involves moving from the conceptual stage to the logical model to the physical schema. It involves the systematic method of applying data modeling techniques.
Database Design: This is the process of designing the database. The database design creates an output which is a detailed data model of the database. Strictly speaking, database design includes the detailed logical model of a database but it can also include physical design choices and storage parameters.
11. What are the important skills to have in Python with regard to data analysis?
The following are some of the important skills to possess which will come handy when performing data analysis using Python.
- Good understanding of the built-in data types especially lists, dictionaries, tuples, and sets.
Mastery of N-dimensional NumPy arrays.
Mastery of pandas data frames.
- Ability to perform element-wise vector and matrix operations on NumPy arrays. This requires the biggest shift in mindset for someone coming from a traditional software development background who’s used to for loops.
- Knowing that you should use the Anaconda distribution and the conda package manager.
Familiarity with scikit-learn.
- Ability to write efficient list comprehensions instead of traditional for loops.
- Ability to write small, clean functions (important for any developer), preferably pure functions that don’t alter objects.
Knowing how to profile the performance of a Python script and how to optimize bottlenecks.
The following will help to tackle any problem in data analytics and machine learning.
12. What are the differences between overfitting and underfitting?
Answer: In statistics and machine learning, one of the most common tasks is to fit a model to a set of training data, so as to be able to make reliable predictions on general untrained data.
In overfitting, a statistical model describes random error or noise instead of the underlying relationship. Overfitting occurs when a model is excessively complex, such as having too many parameters relative to the number of observations. A model that has been overfitting has poor predictive performance, as it overreacts to minor fluctuations in the training data.
Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. Underfitting would occur, for example, when fitting a linear model to non-linear data. Such a model too would have poor predictive performance.
13. What do you mean by the term linear regression?
Answer: It is a technique in statistics to model the connection or relationship between one or multiple exploratory variables, attributed as x, and a scalar dependent variable attributed as y. (Data Science training)
14. What prior subject is required to become a data analyst?
- 1. The Math skills
- 1.1 Probability
- 1.2 Statistics
- 1.3 Linear algebra
- 2. The Programming skills
- 2.1 Development Environment
- 2.2 Data Analysis
- 2.3 Data Visualization
- 2.4 machine Learning For
15. Numerical Data– It contains data which have to mean as a measurement. Such as a person’s height, weight, IQ, or blood pressure?
Numerical data can be further broken into two types:
- a. Discrete
- b. Continuous.
- a. Discrete data – Basically, it represents items that can be counted. Basically, they take on possible values that can be listed out. Moreover, the list of possible values may be fixed or it may go to infinity.
- b. Continuous data – It represents measurements. Also, their possible values cannot be counted. Although, it can only be described using intervals on the real number line.
2. Categorical Data– We use it too represents characteristics.
Such as :
- a person’s gender,
- marital status,
- It can take on numerical values:
“1” indicating male, and
“2” indicating female.
But those numbers don’t have mathematical meaning. You couldn’t add them together.
For Example, Qualitative data is another name for categorical data. Moreover, it is called as Yes/No. Also, there is one more data called Ordinal Data.
Let’s begin to learn this:
3. Ordinal data- Basically, it mixes both numerical and categorical data.
The data falls into the category, but the numbers that are placed on the categories must have to mean.
For example, We have to rate a restaurant on a scale from 0 to 4 stars gives ordinal data.
They are often treated as categorical. Although, we have to order the groups whenever it requires creating graphs and charts.
16. What are independent graphics subsystems?
Answer: a. Basically, traditional graphics available in R from the beginning. That is a rich collection of tools that are not very flexible.
b. Grid graphics recent (2000) Low-level tool, flexible
c. Grid forms the basis of two high-level graphics systems:
Lattice: based on Trellis graphics (Cleveland) ggplot2: inspired by “Grammar of Graphics”(Wilkinson)
17. What does a data scientist do?
Answer: a) Parallel architecture: The Teradata Database provides exceptional performance using parallelism to achieve a single answer faster than a non-parallel system. Parallelism uses multiple processors working together to accomplish a task quickly.
b) Single datastore: The Teradata Database acts as a single data store, instead of replicating database for different purposes with the Teradata database we can store the data once and use it for all applications. The Teradata database provides the same connectivity for all systems.
c) Scalability: Scalability is nothing but we can add components to the system, the performance increase as linear. Scalability enables the system to grow to support more users/data/queries/complexity of queries without experiencing performance degradation.
18. What is Bias – Variance tradeoff?
Answer: Bias and Variance are two independent sources of errors for machine learning which prevent algorithms to generalize the models learned beyond the training set.
a) Bias is the error representing missing relations between features and outputs. In machine learning, this phenomenon is called underfitting.
b) Variance is the error representing sensitiveness to small training data fluctuations. In machine learning, this phenomenon is called overfitting.
A good learning algorithm should capture patterns in the training data (low bias), but it should also generalize well with unseen application data. In general, a complex model can show low bias because it captures many relations in the training data and, at the same time, it can show high variance because it will not necessarily generalize well. The opposite happens with models with high bias and low variance. In many algorithms an error can be analytically decomposed in three components: bias, variance and the irreducible error representing a lower bound on the expected error for unseen sample data.
One way to reduce the variance is to try to get more data or to decrease the complexity of a model. One way to reduce the bias is to add more features or to make the model more complex, as adding more data will not help in this case. Finding the right balance between Bias and Variance is an art that every Data scientist must be able to manage.
19. Why data cleansing is important in data analysis?
Answer: With data coming in from multiple sources it is important to ensure that data is good enough for analysis. This is where data cleansing becomes extremely vital. Data cleansing extensively deals with the process of detecting and correcting data records, ensuring that data is complete and accurate and the components of data that are irrelevant are deleted or modified as per the needs. This process can be deployed in concurrence with data wrangling or batch processing.
Once the data is cleaned it confirms with the rules of the data sets in the system. Data cleansing is an essential part of the data science because the data can be prone to error due to human negligence, corruption during transmission or storage among other things. Data cleansing takes a huge chunk of time and effort of a Data Scientist because of the multiple sources from which data emanates and the speed at which it comes.
20. Describe univariate, bivariate and multivariate analysis?
Answer: As the name suggests these are analysis methodologies having a single, double or multiple variables.
So a univariate analysis will have one variable and due to this, there are no relationships, causes. The major aspect of the univariate analysis is to summarize the data and find the patterns within it to make actionable decisions.
A Bivariate analysis deals with the relationship between two sets of data. These sets of paired data come from related sources, or samples. There are various tools to analyze such data including the chi-squared tests and t-tests when the data are having a correlation. If the data can be quantified then it can be analyzed using a graph plot or a scatterplot. The strength of the correlation between the two data sets will be tested in a Bivariate analysis.
21. How do Data Scientists use Statistics?
Answer: Statistics help Data Scientists to look into the data for patterns, hidden insights and convert Big Data into Big insights. It helps to get a better idea of what the customers are expecting. Data Scientists can learn about consumer behavior, interest, engagement, retention and finally conversion all through the power of insightful statistics. It helps them to build powerful data models in order to validate certain inferences and predictions. All this can be converted into a powerful business proposition by giving users what they want at precisely when they want it.
22. Explain the various benefits of R language?
Answer: The R programming language includes a set of a software suite that is used for graphical representation, statistical computing, data manipulation, and calculation.
Some of the highlights of the R programming environment include the following:
- An extensive collection of tools for data analysis
- Operators for performing calculations on matrix and array
- Data analysis technique for graphical representation
- A highly developed yet simple and effective programming language
- It extensively supports machine learning applications
- It acts as a connecting link between various software, tools, and datasets
- Create high-quality reproducible analysis that is flexible and powerful
- Provides a robust package ecosystem for diverse needs
- It is useful when you have to solve a data-oriented problem
23. Compare SAS, R, and Python programming?
Answer: SAS: it is one of the most widely used analytics tools used by some of the biggest companies on earth. It has some of the best statistical functions, graphical user interface, but can come with a price tag and hence it cannot be readily adopted by smaller enterprises
R: The best part about R is that it is an Open Source tool and hence used generously by academia and the research community. It is a robust tool
for statistical computation, graphical representation, and reporting. Due to its open source nature, it is always being updated with the latest features and then readily available to everybody.
Python: Python is a powerful open source programming language that is easy to learn, works well with most other tools and technologies. The best part about Python is that it has innumerable libraries and community created modules making it very robust. It has functions for statistical operation, model building and more.
R and Python are two of the most important programming languages for Machine Learning Algorithms.
24. Which language is more suitable for text analytics? R or Python?
Answer: Since Python consists of a rich library called Pandas which allows the analysts to use high-level data analysis tools as well as data structures, while R lacks this feature. Hence Python will more suitable for text analytics.
25. How do you work towards a random forest?
Answer: The underlying principle of this technique is that several weak learners combined to provide a strong learner. The steps involved are
Build several decision trees on bootstrapped training samples of data
On each tree, each time a split is considered, a random sample of mm predictors is chosen as split candidates, out of all pp predictors
26. Explain survivorship bias?
Answer: It is the logical error of focusing aspects that support surviving some process and casually overlooking those that did not because of their lack of prominence. This can lead to wrong conclusions in numerous different means.
27. Why is resampling done?
Answer: Resampling is done in any of these cases:
Estimating the accuracy of sample statistics by using subsets of accessible data or drawing randomly with replacement from a set of data points
Substituting labels on data points when performing significance tests
Validating models by using random subsets (bootstrapping, cross-validation)
28. What are the types of biases that can occur during sampling?
- Selection bias
- Under coverage bias
- Survivorship bias
29. What are Eigenvalue and Eigenvector?
Answer: Eigenvectors are for understanding linear transformations. In data analysis, we usually calculate the eigenvectors for a correlation or covariance matrix. Eigenvalues are the directions along which a particular linear transformation acts by flipping, compressing or stretching.
30. Do gradient descent methods at all times converge to a similar point?
Answer: No, they do not because in some cases they reach a local minimum or a local optima point. You would not reach the global optima point. This is governed by the data and the starting conditions.
31. What are the drawbacks of the linear model?
Some drawbacks of the linear model are:
The assumption of linearity of the errors.
It can’t be used for count outcomes or binary outcomes
There are overfitting problems that it can’t solve
Nervous about your interview? Enroll in our Data Science course and walk into your next interview with confidence.
All Data Science Interview Questions
Data Science VideosDuration: 30+ Hours
- Experienced Faculty
- Real-time Scenarios
- Free Bundle Access
- Course Future Updates
- Sample CV/Resume
- Interview Q&A
- Complimentary Materials