1. What is Apache Hadoop?
Answer: Hadoop emerged as a solution to the “Big Data” problems. It is a part of the Apache project sponsored by the Apache Software Foundation (ASF). It is an open source software framework for distributed storage and distributed processing of large data sets. Open source means it is freely available and even we can change its source code as per our requirements. Apache Hadoop makes it possible to run applications on the system with thousands of commodity hardware nodes. It’s distributed file system has the provision of rapid data transfer rates among nodes. It also allows the system to continue operating in case of node failure.
Apache Hadoop provides:
- Storage layer
- Batch processing engine
- Resource Management Layer.
2. How big data analysis helps businesses increase their revenue? Give an example?
Answer: Big data analysis is helping businesses differentiate themselves – for example, Walmart the world’s largest retailer in 2014 in terms of revenue – is using big data analytics to increase its sales through better predictive analytics, providing customized recommendations and launching new products based on customer preferences and needs. Walmart observed a significant 10% to 15% increase in online sales for $1 billion in incremental revenue. There are many more companies like Facebook, Twitter, LinkedIn, Pandora, JPMorgan Chase, Bank of America, etc. using big data analytics to boost their revenue.
Here is an interesting video that explains how various industries are leveraging big data analysis to increase their revenue.
3. Explain the usage of Context Object?
Answer: Context Object is used to help the mapper interact with other Hadoop systems. Context Object can be used for updating counters, to report the progress and to provide any application-level status updates. Context Object has the configuration details for the job and also interfaces, that helps it to generate the output.
4. Explain about the SMB Join in Hive?
Answer: In SMB join in Hive, each mapper reads a bucket from the first table and the corresponding bucket from the second table and then a merge sort join is performed. Sort Merge Bucket (SMB) join in a hive is mainly used as there is no limit on file or partition or table join. SMB join can best be used when the tables are large. In SMB join the columns are bucketed and sorted using the join columns. All tables should have the same number of buckets in SMB join.
5. Why do we need Hadoop?
Answer: The picture of Hadoop came into existence to deal with Big Data challenges. The challenges with Big Data are-
- Storage – Since data is very large, so storing such huge amount of data is very difficult.
- Security – Since the data is huge in size, keeping it secure is another challenge.
- Analytics – In Big Data, most of the time we are unaware of the kind of data we are dealing with. So analyzing that data is even more difficult.
6. What are the main components of a Hadoop Application?
Answer: Hadoop applications have a wide range of technologies that provide great advantage in solving complex business problems.
Core components of a Hadoop application are-
1) Hadoop Common
3) Hadoop MapReduce
- Data Access Components are – Pig and Hive
- Data Storage Component is – HBase
- Data Integration Components are – Apache Flume, Sqoop, Chukwa
- Data Management and Monitoring Components are – Ambari, Oozie, and Zookeeper.
- Data Serialization Components are – Thrift and Avro
- Data Intelligence Components are – Apache Mahout and Drill.
7. Can Apache Kafka be used without Zookeeper?
Answer: It is not possible to use Apache Kafka without Zookeeper because if the Zookeeper is down Kafka cannot serve client request.
8. Name some companies that use Hadoop.?
Answer: Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)
9. What is the best hardware configuration to run Hadoop?
Answer: The best configuration for executing Hadoop jobs is dual-core machines or dual processors with 4GB or 8GB RAM that use ECC memory. Hadoop highly benefits from using ECC memory though it is not low – end. ECC memory is recommended for running Hadoop because most of the Hadoop users have experienced various checksum errors by using non-ECC memory. However, the hardware configuration also depends on the workflow requirements and can change accordingly.
10. How will you test data quality?
Answer: The entire data that has been collected could be important but all data is not equal so it is necessary to first define from where the data came, how the data would be used and consumed. Data that will be consumed by vendors or customers within the business ecosystem should be checked for quality and needs to clean. This can be done by applying stringent data quality rules and by inspecting different properties like conformity, perfection, repetition, reliability, validity, completeness of data, etc.
11. What is the difference between Hadoop and Traditional RDBMS?
- Data types
- Best Fit for Applications
- Processes semi-structured and unstructured data.
- Schema on Reading
- Data discovery and Massive Storage/Processing of Unstructured data.
- Writes are Fast
- Processes structured data.
- Schema on Write
- Best suited for OLTP and complex ACID transactions.
- Reads are Fast
12. What are the steps involved in deploying a big data solution?
i) Data Ingestion – The foremost step in deploying big data solutions is to extract data from different sources which could be an Enterprise Resource Planning System like SAP, any CRM like Salesforce or Siebel, RDBMS like MySQL or Oracle, or could be the log 5. Differentiate between Structured and Unstructured data. Data which can be stored in traditional database systems in the form of rows and columns, for example, the online purchase transactions can be referred to as Structured Data. Data which can be stored only partially in traditional database systems, for example, data in XML records can be referred to as semi-structured data. Unorganized and raw data that cannot be categorized as semi-structured or structured data is referred to as unstructured data. Facebook updates, tweets on Twitter, Reviews, weblogs, etc. are all examples of unstructured data.
files, flat files, documents, images, social media feeds. This data needs to be stored in HDFS. Data can either be ingested through batch jobs that run every 15 minutes, once every night and so on or through streaming in real-time from 100 ms to 120 seconds.
ii) Data Storage – The subsequent step after ingesting data is to store it either in HDFS or NoSQL database like HBase. HBase storage works well for random read/write access whereas HDFS is optimized for sequential access.
iii) Data Processing – The ultimate step is to process the data using one of the processing frameworks like MapReduce, spark, pig, hive, etc.
13. What is Big Data?
Answer: Big data is defined as the voluminous amount of structured, unstructured or semi-structured data that has huge potential for mining but is so large that it cannot be processed using traditional database systems. Big data is characterized by its high velocity, volume, and variety that requires cost-effective and innovative methods for information processing to draw meaningful business insights. More than the volume of the data – it is the nature of the data that defines whether it is considered as Big Data or not.
14. Here is an interesting and explanatory visual on “What is Big Data?
We have further categorized Big Data Interview Questions for Freshers and Experienced-
Hadoop Interview Questions and Answers for Freshers – Q.Nos- 1,2,4,5,6,7,8,9
Hadoop Interview Questions and Answers for Experienced – Q.Nos-3,8,9,10
15. How can you overwrite the replication factors in HDFS?
Answer: The replication factor in HDFS can be modified or overwritten in 2 ways-
1)Using the Hadoop FS Shell, replication factor can be changed per-file basis using the below command-
$hadoop fs –setup –w 2 /my/test_file (test_file is the filename whose replication factor will be set to 2)
2)Using the Hadoop FS Shell, a replication factor of all files under a given directory can be modified using the below command-
3)$hadoop fs –strep –w 5 /my/test_dir (test_dir is the name of the directory and all the files in this directory will have a replication factor set to 5)
16. How will you choose various file formats for storing and processing data using Apache Hadoop?
The decision to choose a particular file format is based on the following factors-
i) Schema evolution to add, alter and rename fields.
ii) Usage pattern like accessing 5 columns out of 50 columns vs accessing most of the columns.
iii)Applicability to be processed in parallel.
iv) Read/Write/Transfer performance vs block compression saving storage space
File Formats that can be used with Hadoop – CSV, JSON, Columnar, Sequence files, AVRO, and Parquet file.
- CSV Files: CSV files are an ideal fit for exchanging data between Hadoop and external systems. It is advisable not to use header and footer lines when using CSV files.
- JSON Files: Every JSON File has its own record. JSON stores both data and schema together in a record and also enables complete schema evolution and split ability. However, JSON files do not support block-level compression.
- Avro FIles: This kind of file format is best suited for long term storage with Schema. Avro files store metadata with data and also let you specify an independent schema for reading the files.
- Parquet Files: A columnar file format that supports block-level compression and is optimized for query performance as it allows selection of 10 or fewer columns from 50+ columns records.
- edits file- It is a log of changes that have been made to the namespace since the checkpoint.
- Checkpoint Node- Checkpoint Node keeps track of the latest checkpoint in a directory that has the same structure as that of the Name Node’s directory. Checkpoint node creates checkpoints for the namespace at regular intervals by downloading the edits and fsimage file from the NameNode and merging it locally. The new image is then again updated back to the active NameNode.
- Backup Node: Backup Node also provides checkpointing functionality like that of the checkpoint node but it also maintains its up-to-date in-memory copy of the file system namespace that is in sync with the active NameNode.
17. What are the most commonly defined input formats in Hadoop?
The most common Input Formats defined in Hadoop are:
- Text Input Format- This is the default input format defined in Hadoop.
- Key-Value Input Format- This input format is used for plain text files wherein the files are broken down into lines.
- Sequence File Input Format- This input format is used for reading files in sequence.
18. Explain what happens if, during the PUT operation, HDFS block is assigned a replication factor 1 instead of the default value 3?
Answer: Replication factor is a property of HDFS that can be set accordingly for the entire cluster to adjust the number of times the blocks are to be replicated to ensure high data availability. For every block that is stored in HDFS, the cluster will have n-1 duplicated blocks. So, if the replication factor during the PUT operation is set to 1 instead of the default value 3, then it will have a single copy of data. Under these circumstances when the replication factor is set to 1, if the DataNode crashes under any circumstances, then an only a single copy of the data would be lost.
19. What is a block and block scanner in HDFS?
- Block – The minimum amount of data that can be read or written is generally referred to as a “block” in HDFS. The default size of a block in HDFS is 64MB.
- Block Scanner – Block Scanner tracks the list of blocks present on a DataNode and verifies them to find any kind of checksum errors. Block Scanners use a throttling mechanism to reserve disk bandwidth on the datanode.
20. What is commodity hardware?
Answer: Commodity Hardware refers to inexpensive systems that do not have high availability or high quality. Commodity Hardware consists of RAM because there are specific services that need to be executed on RAM. Hadoop can be run on any commodity hardware and does not require any super computer s or high-end hardware configuration to execute jobs.
21. What is the process to change the files at arbitrary locations in HDFS?
Answer: HDFS does not support modifications at arbitrary offsets in the file or multiple writers but files are written by a single writer in an append the only format i.e. writes to a file in HDFS are always made at the end of the file.
22. Explain the process of inter-cluster data copying?
Answer: HDFS provides a distributed data copying facility through the DistCP from source to destination. If this data copying is within the Hadoop cluster then it is referred to as inter-cluster data copying. DistCP requires both source and destination to have a compatible or same version of Hadoop.
23. Whenever a client submits a Hadoop job, who receives it?
Answer: NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the Hadoop job to ensure timely completion.
24. What are the core methods of a Reducer?
The 3 core methods of a reducer are –
- Setup () – This method of the reducer is used for configuring various parameters like the input data size, distributed cache, heap size, etc. Function Definition- public void setup (context)
- Reduce () it is the heart of the reducer which is called once per key with the associated reduce task.
Function Definition -public void reduce (Key,Value,context)
- Cleanup () – This method is called only once at the end of reduce task for clearing all the temporary files.
Function Definition -public void cleanup (context)
25. Explain about the indexing process in HDFS?
Answer: Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of the data chunk is stored.
26. What happens to a NameNode that has no data?
Answer: There does not exist any NameNode without data. If it is a NameNode then it should have some sort of data in it.
27. What happens when a user submits a Hadoop job when the NameNode is down- does the job get in to hold or does it fail?
The Hadoop job fails when the Name Node is down.
28. What is Row Key?
The Hadoop job fails when the Name Node is down.
29. Explain the difference between NAS and HDFS?
- NAS runs on a single machine and thus there is no probability of data redundancy whereas HDFS runs on a cluster of different machines thus there is data redundancy because of the replication protocol.
- NAS stores data on dedicated hardware whereas in HDFS all the data blocks are distributed across local drives of the machines.
- In NAS data is stored independent of the computation and hence Hadoop MapReduce cannot be used for processing whereas HDFS works with Hadoop MapReduce as the computations in HDFS are moved to data.
30. What do you understand by edge nodes in Hadoop?
Edges nodes are the interface between the Hadoop cluster and the external network. Edge nodes are used for running cluster administration tools and client applications. Edge nodes are also referred to as gateway nodes.
31. Explain about the partitioning, shuffle and sort phase?
- Shuffle Phase-Once the first map tasks are completed, the nodes continue to perform several other map tasks and also exchange the intermediate outputs with the reducers as required. This process of moving the intermediate outputs of map tasks to the reducer is referred to as Shuffling.
- Sort Phase- Hadoop MapReduce automatically sorts the set of intermediate keys on a single node before they are given as input to the reducer.
- Partitioning Phase-The process that determines which intermediate keys and value will be received by each reducer instance is referred to as partitioning. The destination partition is the same for any key irrespective of the mapper instance that generated it.
32. Is it possible to do an incremental import using Sqoop?
Yes, Sqoop supports two types of incremental imports-
- 2)Last Modified
To insert only rows Append should be used in import command and for inserting the rows and also updating Last-Modified should be used in the import command.
33. What is a rack awareness and on what basis is data stored in a rack?
Answer: All the data nodes put together to form a storage area i.e. the physical location of the data nodes is referred to as Rack in HDFS. The rack information i.e. the rack id of each data node is acquired by the NameNode. The process of selecting closer data nodes depending on the rack information is known as Rack Awareness.
The contents present in the file are divided into data block as soon as the client is ready to load the file into the Hadoop cluster. After consulting with the NameNode, the client allocates 3 data nodes for each data block. For each data block, there exist 2 copies in one rack and the third copy is present in another rack. This is generally referred to as the Replica Placement Policy.
34. When should you use HBase and what are the key components of HBase?
HBase should be used when the big data application has –
- A variable schema
- When data is stored in the form of collections
- If the application demands key-based access to data while retrieving.
Key components of HBase are:
- Region- This component contains a memory data store and Hfile.
- Region Server-This monitors the Region.
- HBase Master- It is responsible for monitoring the region server.
- Zookeeper- It takes care of the coordination between the HBase Master component and the client.
- Catalog Tables-The two important catalog tables are ROOT and META.ROOT table tracks where the META table is and META table stores all the regions in the system.
35. What are the challenges that you faced when implementing Hadoop projects?
Answer: Interviewers are interested to know more about the various issues you have encountered in the past when working with Hadoop clusters and understand how you addressed them. The way you answer this question tells a lot about your expertise in troubleshooting and debugging Hadoop clusters. The more issues you have encountered, the more probability there is, that you have become an expert in that area of Hadoop. Ensure that you list out all the issues that have troubleshooted.
36. Explain the difference between RDBMS data model and the HBase data model?
- RDBMS is a schema-based database whereas HBase is schema-less data model.
- RDBMS does not have support for in-built partitioning whereas in HBase there is automated partitioning.
- RDBMS stores normalized data whereas HBase stores de-normalized data.
37. What is column families? What happens if you alter the block size of Column Family on an already populated database?
Answer: The logical deviation of data is represented through a key known as column Family. Column families consist of the basic unit of physical storage on which compression features can be applied. In an already populated database, when the block size of column family is altered, the old data will remain within the old block size whereas the new data that comes in will take the new block size. When compaction takes place, the old data will take the new block size so that the existing data is read correctly.
38. Explain the difference between HBase and Hive?
Answer: HBase and Hive both are completely different Hadoop based technologies-Hive is a data warehouse infrastructure on top of Hadoop whereas HBase is a NoSQL key-value store that runs on top of Hadoop. Hive helps SQL savvy people to run MapReduce jobs whereas HBase supports 4 primary operations-put, get, scan and delete. HBase is ideal for real-time querying of big data where Hive is an ideal choice for analytical querying of data collected over a period of time.
39. What are the different operational commands in HBase at a record level and table level?
- Record Level Operational Commands in HBase are –put, get, increment, scan and delete.
- Table Level Operational Commands in HBase are-describe, list, drop, disable and scan.
40. What are the different types of tombstone markers in HBase for deletion?
There are 3 different types of tombstone markers in HBase for deletion-
- Family Delete Marker- This marker marks all columns for a column family.
- Version Delete Marker-This marker marks a single version of a column.
- Column Delete Marker-This markers mark all the versions of a column.
41. Explain about the different catalog tables in HBase?
Answer: The two important catalog tables in HBase, are ROOT and META. ROOT table tracks where the META table is and META table stores all the regions in the system.
42. Explain the process of row deletion in HBase?
Answer: On issuing a delete command in HBase through the HBase client, data is not actually deleted from the cells but rather the cells are made invisible by setting a tombstone marker. The deleted cells are removed at regular intervals during compaction.
43. How Sqoop can be used in a Java program?
Answer: The Sqoop jar in classpath should be included in the java code. After this, the method Sqoop.runTool () method must be invoked. The necessary parameters should be created to Sqoop programmatically just like for command line.
44. How to write a custom partitioner for a Hadoop MapReduce job?
Steps to write a Custom Partitioner for a Hadoop MapReduce Job-
- A new class must be created that extends the pre-defined Partitioner Class.
- get partition method of the Partitioner class must be overridden.
- The custom partitioner to the job can be added as a config file in the wrapper which runs Hadoop MapReduce or the custom partitioner can be added to the job by using the set method of the partitioner class.
- We have further.
45. How can you check all the tables present in a single database using Sqoop?
The command to check the list of all tables present in a single database using Sqoop is as follows-
Sqoop list-tables –connect JDBC: MySQL: //localhost/user;
46. What is the process to perform an incremental data load in Sqoop?
The process to perform incremental data load in Sqoop is to synchronize the modified or updated data (often referred to as delta data) from RDBMS to Hadoop. The delta data can be facilitated through the incremental load command in Sqoop.
The incremental load can be performed by using the Sqoop import command or by loading the data into the hive without overwriting it. The different attributes that need to be specified during incremental load in Sqoop are-
- Mode (incremental) –The mode defines how Sqoop will determine what the new rows are. The mode can have value as Append or Last-Modified.
- Col (Check-column) –This attribute specifies the column that should be examined to find out the rows to be imported.
- Value (last-value) –This denotes the maximum value of the check column from the previous import operation.
47. Explain about some important Sqoop commands other than import and export?
Create Job (–create)
Here we are creating a job with the name of my job, which can import the table data from the RDBMS table to HDFS. The following command is used to create a job that is importing data from the employee table in the DB database to the HDFS file.
$ Sqoop job –create my job \
–connect JDBC:MySQL://localhost/DB \
–username root \
–table employee –m 1
Verify Job (–list)
‘–list’ argument is used to verify the saved jobs. The following command is used to verify the list of saved Sqoop jobs.
$ Sqoop job –list
Inspect Job (–show)
‘–show’ argument is used to inspect or verify particular jobs and their details. The following command and sample output are used to verify a job called my job.
$ Sqoop job –show my job
Execute Job (–exec)
‘–exec’ option is used to execute a saved job. The following command is used to execute a saved job called my job.
$ Sqoop job –exec my job
48. How are large objects handled in Sqoop?
Answer: Sqoop provides the capability to store large-sized data into a single field based on the type of data. Sqoop supports the ability to store-
- CLOB ‘s – Character Large Objects
- BLOB –Binary Large Objects
Large objects in Sqoop are handled by importing the large objects into a file referred to as “mobile” i.e. Large Object File. The mobile has the ability to store records of huge size, thus each record in the LobFile is a large object.
49. Explain about HLog and WAL in HBase?
Answer: All edits in the HStore are stored in the HLog. Every region server has one HLog. HLog contains entries for edits of all regions performed by a particular Region Server.WAL abbreviates to Write Ahead Log (WAL) in which all the HLog edits are written immediately.WAL edits remain in the memory till the flush period in case of deferred log flush.
We have further categorized Hadoop HBase Interview Questions for Freshers and Experienced
50. Explain about the core components of Flume?
The core components of Flume are –
- Event- The single log entry or unit of data that is transported.
- Source- This is the component through which data enters Flume workflows.
- Sink- It is responsible for transporting data to the desired destination.
- Channel- it is the duct between the Sink and Source.
- Agent- Any JVM that runs Flume.