1. How do you approach data preparation?
Answer: How to Approach: Data preparation is one of the crucial steps in big data projects. A big data interview may involve at least one question based on data preparation. When the interviewer asks you this question, he wants to know what steps or precautions you take during data preparation.
As you already know, data preparation is required to get necessary data which can then further be used for modeling purposes. You should convey this message to the interviewer. You should also emphasize the type of model you are going to use and reasons behind choosing that particular model. Last, but not the least, you should also discuss important data preparation terms such as transforming variables, outlier values, unstructured data, identifying gaps, and others.
2. How would you transform unstructured data into structured data?
Answer: How to Approach: Unstructured data is very common in big data. The unstructured data should be transformed into structured data to ensure proper data analysis. You can start answering the question by briefly differentiating between the two. Once done, you can now discuss the methods you use to transform one form to another. You might also share the real-world situation where you did it. If you have recently been graduated, then you can share information related to your academic projects.
By answering this question correctly, you are signaling that you understand the types of data, both structured and unstructured, and also have the practical experience to work with these. If you answer this question specifically, you will be able to crack the big data interview.
Enhance your Big Data skills with the experts. Here is the Complete List of Big Data Blogs where you can find the latest news, trends, updates, and concepts of Big Data.
3. Explain some important features of Hadoop?
Answer: Hadoop supports the storage and processing of big data. It is the best solution for handling big data challenges
Some important features of Hadoop are –
- Open Source – Hadoop is an open source framework which means it is available free of cost. Also, the users are allowed to change the source code as per their requirements.
- Distributed Processing – Hadoop supports distributed processing of data i.e. faster processing. The data in Hadoop HDFS is stored in a distributed manner and MapReduce is responsible for the parallel processing of data.
- Fault Tolerance – Hadoop is highly fault-tolerant. It creates three replicas for each block at different nodes, by default. This number can be changed according to the requirement. So, we can recover the data from another node if one node fails. The detection of node failure and recovery of data is done automatically.
- Reliability – Hadoop stores data on the cluster in a reliable manner that is independent of machine. So, the data stored in a Hadoop environment is not affected by the failure of the machine.
- Scalability – Another important feature of Hadoop is the scalability. It is compatible with the other hardware and we can easily ass the new hardware to the nodes.
- High Availability – The data stored in Hadoop is available to access even after the hardware failure. In case of hardware failure, the data can be accessed from another path.
4. How can you achieve security in Hadoop?
Answer: Kerberos are used to achieve security in Hadoop. There are 3 steps to access service while using Kerberos, at a high level. Each step involves a message exchange with a server.
- Authentication – The first step involves authentication of the client to the authentication server, and then provides a time-stamped TGT (Ticket-Granting Ticket) to the client.
- Authorization – In this step, the client uses received TGT to request a service ticket from the TGS (Ticket Granting Server).
Service Request – It is the final step to achieve security in Hadoop. Then the client uses a service ticket to authenticate himself to the server.
5. How to restart all the daemons in Hadoop?
Answer: To restart all the daemons, it is required to stop all the daemons first. The Hadoop directory contains sbin directory that stores the script files to stop and start daemons in Hadoop.
Use stop daemons command /sbin/stop-all.sh to stop all the daemons and then use /sin/start-all.sh command to start all the daemons again
6. Explain the term ‘Commodity Hardware?
Answer: Commodity Hardware refers to the minimal hardware resources and components, collectively needed, to run the Apache Hadoop framework and related data management tools. Apache Hadoop requires 64-512 GB of RAM to execute tasks, and any hardware that supports its minimum requirements is known as ‘Commodity Hardware.’
7. Why do we need Hadoop for Big Data Analytics?
Answer: In most cases, exploring and analyzing large unstructured data sets becomes difficult with the lack of analysis tools. This is where Hadoop comes in as it offers storage, processing, and data collection capabilities. Hadoop stores data in its raw forms without the use of any schema and allows the addition of any number of nodes.
Since Hadoop is open-source and is run on commodity hardware, it is also economically feasible for businesses and organizations to use it for Big Data Analytics.
8. What are the Edge Nodes in Hadoop?
Answer: Edge nodes are gateway nodes in Hadoop which act as the interface between the Hadoop cluster and external network. They run client applications and cluster administration tools in Hadoop and are used as staging areas for data transfers to the Hadoop cluster. Enterprise-class storage capabilities (like 900GB SAS Drives with Raid HDD Controllers) is required for Edge Nodes, and a single edge node usually suffices for multiple Hadoop clusters.
9. Data Analysis Process?
Answer: Five steps of Analysis Process
- Step 1: Define Your Questions
- Step 2: Set Clear Measurement Priorities
- Step 3: Collect Data
- Step 4: Analyse Data
- Step 5: Interpret Results
10. What is MapReduce?
Answer: It is a core component, Apache Hadoop Software framework.
It is a programming model and an associated implementation for processing generating large data.
This data sets with a parallel, and distributed algorithm on a cluster, each node of the cluster includes own storage. Senior Data Architect Interview Questions
11. How is big data analysis helpful in increasing business revenue?
Answer: Big data analysis has become very important for businesses. It helps businesses to differentiate themselves from others and increase the revenue. Through predictive analytics, big data analytics provides businesses customized recommendations and suggestions. Also, big data analytics enables businesses to launch new products depending on customer needs and preferences. These factors make businesses earn more revenue, and thus companies are using big data analytics. Companies may encounter a significant increase of 5-20% in revenue by implementing big data analytics. Some popular companies that are using big data analytics to increase their revenue is – Walmart, LinkedIn, Facebook, Twitter, Bank of America, etc.
12. What are the common input formats in Hadoop?
Answer: Below are the common input formats in Hadoop –
Text Input Format – The default input format defined in Hadoop is the Text Input Format.
Sequence File Input Format – To read files in a sequence, Sequence File Input Format is used.
Key-Value Input Format – The input format used for plain text files (files broken into lines) is the Key Value Input Format.
13. How is NFS different from HDFS?
Answer: Several distributed file systems work in their way. NFS (Network File System) is one of the oldest and popular distributed file storage systems whereas HDFS (Hadoop Distributed File System) is the recently used and popular one to handle big data. The main differences between NFS and HDFS are as follows.
14. What is the use of jps command in Hadoop?
Answer: The jps command is used to check if the Hadoop daemons are running properly or not. This command shows all the daemons running on a machine i.e. Datanode, Namenode, NodeManager, ResourceManager, etc.
15. What will happen with a NameNode that doesn’t have any data?
Answer: A NameNode without any data doesn’t exist in Hadoop. If there is a NameNode, it will contain some data in it or it won’t exist.
16. Define Big Data And Explain The Five Vs of Big Data?
Answer: One of the most introductory Big Data questions asked during interviews, the answer to this is fairly straightforward-
Big Data is defined as a collection of large and complex unstructured data sets from where insights are derived from Data Analysis using open-source tools like Hadoop.
The five Vs of Big Data are –
- Volume – Amount of data in Petabytes and Exabytes
- Variety – Includes formats like videos, audio sources, textual data, etc.
- Velocity – Everyday data growth which includes conversations in forums, blogs, social media posts, etc.
- Veracity – Degree of the accuracy of data available
- Value – Deriving insights from collected data to achieve business milestones and new heights.
17. Define and describe the term FSCK?
Answer: FSCK (File System Check) is a command used to run a Hadoop summary report that describes the state of the Hadoop file system. This command is used to check the health of the file distribution system when one or more file blocks become corrupt or unavailable in the system. FSCK only checks for errors in the system and does not correct them, unlike the traditional FSCK utility tool in Hadoop. The command can be run on the whole system or a subset of files.
18. Name the different commands for starting up and shutting down Hadoop Daemons?
Answer: To start up all the Hadoop Deamons together-
To shut down all the Hadoop Daemons together-
To start up all the daemons related to DFS, YARN, and MR Job History Server, respectively-
sbin/mr-jobhistory-daemon.sh start history server
To stop the DFS, YARN, and MR Job History Server daemons, respectively-
/sbin/mr-jobhistory-daemon.sh stop historyserver
The final way is to start up and stop all the Hadoop Daemons individually –
./sbin/hadoop-daemon.sh start namenode
./sbin/hadoop-daemon.sh start datanode
./sbin/yarn-daemon.sh start resourcemanager
./sbin/yarn-daemon.sh start nodemanager
./sbin/mr-jobhistory-daemon.sh start historyserver
19. Explain the different features of Hadoop?
Answer: Listed in many Big Data Interview Questions and Answers, the answer to this is-
- Open-Source- Open-source frameworks include source code that is available and accessible by all over the World Wide Web. These code snippets can be rewritten, edited, and modifying according to user and analytics requirements.
- Scalability – Although Hadoop runs on commodity hardware, additional hardware resources can be added to new nodes.
- Data Recovery – Hadoop allows the recovery of data by splitting blocks into three replicas across clusters. Hadoop allows users to recover data from node to node in cases of failure and recovers tasks/nodes automatically during such instances.
- User-Friendly – for users who are new to Data Analytics, Hadoop is the perfect framework to use as its user interface is simple and there is no need for clients to handle distributed computing processes as the framework takes care of it.
- Data Locality – Hadoop features Data Locality which moves computation to data instead of data to computation. Data is moved to clusters rather than bringing them to the location where MapReduce algorithms are processed and submitted.
20. How do HDFS Index Data blocks? Explain?
Answer: HDFS indexes data blocks based on their respective sizes. The end of a data block points to the address of where the next chunk of data blocks get stored. The DataNodes store the blocks of data while the NameNode manages these data blocks by using an in-memory image of all the files of said data blocks. Clients receive information related to data blocked from the NameNode.
Enterprise Data Architect Interview Questions
1. How businesses could be benefitted with Big Data?
Answer: Big data analysis helps with the business to render real-time data.
It can influence to make a crucial decision on strategies and development of the company.
Big data helps within a large scale to differentiate themselves in a competitive environment.
2. How does A/B testing work?
Answer: A great method for finding the best online promotional and marketing strategies for your organization, it is used to check everything from search ads, emails to website copy. The main goal of A/B testing is to figure out any modification to a webpage to maximize the result of interest.
3. How much data is enough to get a valid outcome?
Answer: Collecting data is like tasting wine- the amount should be accurate. All the businesses are different and measured in different ways. Thus, you never have enough data and there will be no right answer. The amount of data required depends on the methods you use to have an excellent chance of obtaining vital results.
4. What should be carried out with missing data?
Answer: It happens when no data is stored for the variable and data collection is done inadequately. Employees who have experience must analyze data that wary in order to decide if they are adequate.
5. What is JPS used for?
Answer: It is a command used to check Node Manager, Name Node, Resource Manager and Job Tracker are working on the machine.
6. What do you mean by Task Instance?
Answer: A TaskInstance refers to a specific Hadoop MapReduce work process that runs on any given slave node. Each task instance has its very own JVM process that is created by default for aiding its performance.
7. What do you mean by “speculative execution” in context to Hadoop?
Answer: In certain cases, where a specific node slows down the performance of any given task, the master node is capable of executing another task instance on a separate note redundantly. In such a scenario, the task that reaches its completion before the other is accepted, while the other is killed. This entire process is referred to as “speculative execution”.
8. Which classes are used by the Hive to Read and Write HDFS Files?
Answer: Following classes are used by Hive to read and write HDFS files
•TextInputFormat/HiveIgnoreKeyTextOutputFormat: These 2 classes read/write data in plain text file format.
•SequenceFileInputFormat/SequenceFileOutputFormat: These 2 classes read/write data in Hadoop SequenceFile format.
9. If you run hive as a server, what are the available mechanism for connecting it from the application?
Answer: There are following ways by which you can connect with the Hive Server:
Thrift Client: Using thrift you can call hive commands from various programming languages e.g. C++, Java, PHP, Python, and Ruby.
JDBC Driver: It supports the Type 4 (pure Java) JDBC Driver
ODBC Driver: It supports the ODBC protocol.
10. Which database hive used for Metadata store? What are the megastore configuration hive supports?
Answer: Hive can use derby by default and can have three types of metastore configuration. It supports
uses derby DB to store data backed by file stored in the disk. It can’t support multi-session at the same time. and services of metastore runs in same JVM as a hive.
In this case, we need to have a stand-alone DB like MySql, which would be communicated by meta stored services. The benefit of this approach is, it can support multiple hive session at a time. and service still runs in the same process as Hive.
Metastore and Hive service would run in a different process. with stand-alone Mysql kind DB.
Big Data Architect Interview Questions # 1) How do you write your own custom SerDe?
Answer: In most cases, users want to write a Deserializer instead of a SerDe, because users just want to read their own data format instead of writing to it.
- For example, the RegexDeserializer will deserialize the data using the configuration parameter ‘regex’, and possibly a list of column names
- If your SerDe supports DDL (basically, SerDe with parameterized columns and column types), you probably want to implement a Protocol based on DynamicSerDe, instead of writing a SerDe from scratch. The reason is that the framework passes DDL to SerDe through “thrift DDL” format, and it’s non-trivial to write a “thrift DDL” parser.
Big Data Architect Interview Questions # 2) What are Hadoop and its components?
Answer: When “Big Data” emerged as a problem, Apache Hadoop evolved as a solution to it. Apache Hadoop is a framework which provides us various services or tools to store and process Big Data. It helps in analyzing Big Data and making business decisions out of it, which can’t be done efficiently and effectively using traditional systems.
Big Data Architect Interview Questions #3) What does ‘jps’ command do?
Answer: The ‘jps’ command helps us to check if the Hadoop daemons are running or not. It shows all the Hadoop daemons i.e namenode, datanode, resourcemanager, nodemanager, etc. that are running on the machine.
Big Data Architect Interview Questions # 4) What is the purpose of “RecordReader” in Hadoop?
Answer: The “InputSplit” defines a slice of work, but does not describe how to access it. The “RecordReader” class loads the data from its source and converts it into (key, value) pairs suitable for reading by the “Mapper” task. The “RecordReader” instance is defined by the “Input Format”.
Big Data Architect Interview Questions # 5) What is a UDF?
Answer: If some functions are unavailable in built-in operators, we can programmatically create User Defined Functions (UDF) to bring those functionalities using other languages like Java, Python, Ruby, etc. and embed it in Script file.
Big Data Architect Interview Questions # 6) What are the components of Apache HBase?
Answer: HBase has three major components, i.e. HMaster Server, HBase RegionServer and Zookeeper.
- Region Server: A table can be divided into several regions. A group of regions is served to the clients by a Region Server.
- HMaster: It coordinates and manages the Region Server (similar as NameNode manages DataNode in HDFS).
- ZooKeeper: Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
Big Data Architect Interview Questions # 7) How would you check whether your NameNode is working or not?
Answer: There are several ways to check the status of the NameNode. Mostly, one uses the jps command to check the status of all daemons running in the HDFS.
Big Data Architect Interview Questions # 8) Explain about the different catalog tables in HBase?
Answer: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
Big Data Architect Interview Questions # 9) What are the different relational operations in “Pig Latin” you worked with?
Different relational operators are:
- for each
- order by
Big Data Architect Interview Questions # 10) How do “reducers” communicate with each other?
Answer: This is a tricky question. The “MapReduce” programming model does not allow “reducers” to communicate with each other. “Reducers” run in isolation.
1. Tell us how big data and Hadoop are related to each other?
Answer: Big data and Hadoop are almost synonyms terms. With the rise of big data, Hadoop, a framework that specializes in big data operations also became popular. The framework can be used by professionals to analyze big data and help businesses to make decisions.
Note: This question is commonly asked in a big data interview. You can go further to answer this question and try to explain the main components of Hadoop.
2. What are the five V’s of Big Data?
Answer: The five V’s of Big data is as follows:
- Volume – Volume represents the volume i.e. amount of data that is growing at a high rate i.e. data volume in Petabytes
- Velocity – Velocity is the rate at which data grows. Social media contributes a major role in the velocity of growing data.
- Variety – Variety refers to the different data types i.e. various data formats like text, audios, videos, etc.
- Veracity – Veracity refers to the uncertainty of available data. Veracity arises due to the high volume of data that brings incompleteness and inconsistency.
- Value – Value refers to turning data into value. By turning accessed big data into values, businesses may generate revenue.
5 V’s of Big Data
Note: This is one of the basic and significant questions asked in the big data interview. You can choose to explain the five V’s in detail if you see the interviewer is interested to know more. However, the names can even be mentioned if you are asked about the term “Big Data”.
3. Explain the steps to be followed to deploy a Big Data solution?
Answer: Followings are the three steps that are followed to deploy a Big Data Solution –
1. Data Ingestion: The first step for deploying a big data solution is the data ingestion i.e. extraction of data from various sources. The data source may be a CRM like Salesforce, Enterprise Resource Planning System like SAP, RDBMS like MySQL or any other log files, documents, social media feeds, etc. The data can be ingested either through batch jobs or real-time streaming. The extracted data is then stored in HDFS.
Steps of Deploying Big Data Solution
2. Data Storage: After data ingestion, the next step is to store the extracted data. The data either be stored in HDFS or NoSQL database (i.e. HBase). The HDFS storage works well for sequential access whereas HBase for random read/write access.
3. Data Processing: The final step in deploying a big data solution is data processing. The data is processed through one of the processing frameworks like Spark, MapReduce, Pig, etc.
Big data is not just what you think, it’s a broad spectrum. There are a number of career options in Big Data World. Here is an interesting and explanatory visual on Big Data Careers.
4. Do you have any Big Data experience? If so, please share it with us?
Answer: How to Approach: There is no specific answer to the question as it is a subjective question and the answer depends on your previous experience. Asking this question during a big data interview, the interviewer wants to understand your previous experience and is also trying to evaluate if you are fit for the project requirement.
So, how will you approach the question? If you have previous experience, start with your duties in your past position and slowly add details to the conversation. Tell them about your contributions that made the project successful. This question is generally, the 2nd or 3rd question asked in an interview. The later questions are based on this question, so answer it carefully. You should also take care not to go overboard with a single aspect of your previous job. Keep it simple and to the point.
5. Will you optimize algorithms or code to make them run faster?
Answer: How to Approach: The answer to this question should always be “Yes.” Real-world performance matters and it doesn’t depend on the data or model you are using in your project.
The interviewer might also be interested to know if you have had any previous experience in code or algorithm optimization. For a beginner, it obviously depends on which projects he worked on in the past. Experienced candidates can share their experience accordingly as well. However, be honest about your work, and it is fine if you haven’t optimized code in the past. Just let the interviewer know your real experience and you will be able to crack the big data interview.
6. Explain the different modes in which Hadoop run?
Answer: Apache Hadoop runs in the following three modes –
- Standalone (Local) Mode – By default, Hadoop runs in a local mode i.e. on a non-distributed, single node. This mode uses the local file system to perform input and output operation. This mode does not support the use of HDFS, so it is used for debugging. No custom configuration is needed for configuration files in this mode.
- Pseudo-Distributed Mode – In the pseudo-distributed mode, Hadoop runs on a single node just like the Standalone mode. In this mode, each daemon runs in a separate Java process. As all the daemons run on a single node, there is the same node for both the Master and Slave nodes.
- Fully – Distributed Mode – In the fully-distributed mode, all the daemons run on separate individual nodes and thus forms a multi-node cluster. There are different nodes for Master and Slave nodes.
7. Do you prefer good data or good models? Why ?
Answer: How to Approach: This is a tricky question but generally asked in the big data interview. It asks you to choose between good data or good models. As a candidate, you should try to answer it from your experience. Many companies want to follow a strict process of evaluating data, means they have already selected data models. In this case, having good data can be game-changing. The other way around also works as a model is chosen based on good data.
As we already mentioned, answer it from your experience. However, don’t say that having both good data and good models is important as it is hard to have both in real-life projects.
8. What are the different configuration files in Hadoop?
Answer: The different configuration files in Hadoop are –
- core-site.xml – This configuration file contains Hadoop core configuration settings, for example, I/O settings, very common for MapReduce and HDFS. It uses hostname a port.
- mapred-site.xml – This configuration file specifies a framework name for MapReduce by setting MapReduce.framework.name
- hdfs-site.xml – This configuration file contains HDFS daemons configuration settings. It also specifies default block permission and replication checking on HDFS.
- yarn-site.xml – This configuration file specifies configuration settings for ResourceManager and NodeManager.
9. What is commodity hardware?
Answer: Commodity hardware is a low-cost system identified by less-availability and low-quality. The commodity hardware comprises of RAM as it performs a number of services that require RAM for the execution. One doesn’t require high-end hardware configuration or supercomputers to run Hadoop, it can be run on any commodity hardware.
10. Explain the process that overwrites the replication factors in HDFS?
Answer: There are two methods to overwrite the replication factors in HDFS –
Method 1: On File Basis
In this method, the replication factor is changed on the basis of the file using the Hadoop FS shell. The command used for this is:
$hadoop fs – strep –w2/my/test_file
Here, test_file is the filename that’s replication factor will be set to 2.
Method 2: On Directory Basis
In this method, the replication factor is changed on a directory basis i.e. the replication factor for all the files under a given directory is modified.
$hadoop fs –strep –w5/my/test_dir
Here, test_dir is the name of the directory, the replication factor for the directory and all the files in it will be set to 5.