• Big Data Interview Questions For Freshers

New 31 Big Data Interview Questions For Freshers

1. What is a checkpoint ? (Big Data Interview Questions For Freshers)

In brief, “Checkpointing” is a process that takes an FsImage, edit log and compacts them into a new FsImage. Thus, instead of replaying an edit log, the NameNode can load the final in-memory state directly from the FsImage. This is a far more efficient operation and reduces NameNode startup time. Checkpointing is performed by Secondary NameNode.
(Big Data Interview Questions For Freshers)

2. Explain about the indexing process in HDFS ?

Indexing process in HDFS depends on the block size. HDFS stores the last part of the data that further points to the address where the next part of data chunk is stored.

3. Whenever a client submits a hadoop job, who receives it ?

NameNode receives the Hadoop job which then looks for the data requested by the client and provides the block information. JobTracker takes care of resource allocation of the hadoop job to ensure timely completion.

4. What is a “Combiner” ?

A “Combiner” is a mini “reducer” that performs the local “reduce” task. It receives the input from the “mapper” on a particular “node” and sends the output to the “reducer”. “Combiners” help in enhancing the efficiency of “MapReduce” by reducing the quantum of data that is required to be sent to the “reducers”.

5. What are active and passive “NameNodes” ?

In HA (High Availability) architecture, we have two NameNodes – Active “NameNode” and Passive “NameNode”.

Active “NameNode” is the “NameNode” which works and runs in the cluster.
Passive “NameNode” is a standby “NameNode”, which has similar data as active “NameNode”.
When the active “NameNode” fails, the passive “NameNode” replaces the active “NameNode” in the cluster. Hence, the cluster is never without a “NameNode” and so it never fails.

6. Is it possible to use same metastore by multiple users, in case of embedded hive ?

No, it is not possible to use metastore in sharing mode. It is recommended to use standalone “real” database like MySQL or PostGresSQL.

7. What is Apache Hcatalog ?

HCatalog is built on top of the Hive metastore and incorporates Hive’s DDL. Apache Hcatalog is a table and data management layer for hadoop,we can process the data on Hcatalog by using APache pig,Apache Mapreduce and Apache Hive. There is no need to worry in Hcatalog where data is stored and which format of data generated. HCatalog displays data from RCFile format, text files, or sequence files in a tabular view. It also provides REST APIs so that external systems can access these tables’ metadata.

8. How can big data support organizations ?

Big data has the potential to support organizations in many ways. Information extracted from big data can be used in,
• Better coordination with customers and stakeholders and to resolve problems
• Improve reporting and analysis for product or service improvements
• Customize products and services to selected markets
• Ensure better information sharing
• Support in management decisions
• Identify new opportunities, product ideas, and new markets
• Gather data from multiple sources and archive them for future reference
• Maintain databases, systems
• Determine performance metrics
• Understand interdependencies between business functions
• Evaluate organizational performance

(Big Data Interview Questions For Freshers)

9. What are the main challenges big data companies normally encounter ?

A majority of companies across different industries are fighting to extract optimum value for the data in their possession. The important challenges are the four Vs delivering values and getting things done.

10. Name some of the important tools useful for Big Data analytics ?


Rattle GUI

(Best Big Data Architect Interview Questions And Answers)

11. Why are counters useful in Hadoop ?

Counter is an integral part of any Hadoop job.
It is very useful gathering relevant statistics.
Particular job consists of 150 node clusters with 150 mappers.
Counters can be used for keeping a final count of all such records and presenting a single output.

12. Differentiate between Structured and Unstructured data ?

Structured Data

Unstructured Data

Basis algorithms

Old algorithms

Spreadsheet data form machine sensors

Human language


Windows explorer, Mac finder screen

13. What are the three modes in which Hadoop can run ?

Standalone mode
Pseudo Distributed mode (Single node cluster)
Fully distributes mode (Multiple node cluster)

14. What are the characteristics of big data ?

Big data has three main characteristics: Volume, Variety, and Velocity.
Volume characteristic refers to the size of data. Estimates show that over 3 million GB of data is generated every day. Processing this volume of data is not possible in a normal personal computer or in a client-server network in an office environment with limited compute bandwidth and storage capacities. However, cloud services provide solutions to handle big data volumes and process them efficiently using distributed computing architectures.
Variety characteristic refers to the format of big data – structured or unstructured. Traditional RDBMS fits into the structured format. An example of unstructured data format is, a video file format, image files, plain text format, from web document or standard MS Word documents, all have unique formats, and so on. Also to note, RDBMS does not have the capacity to handle unstructured data formats. Further, all this unstructured data must be grouped and consolidated which creates the need for specialized tools and systems. In addition new, data is added each day, or each minute and data grows continuously. Hence big data is more synonymous with variety.
The velocity characteristic refers to the speed in which data is created and the efficiency required to process all the data. For example, Facebook is accessed by over 1.6 billion users in a month. Likewise, there are other social network sites, YouTube, Google services, etc. Such data streams must be processed using queries in real time and must be stored without data loss. Thus, velocity characteristic is important in big data processing.
In addition, other characteristics include veracity and value. Veracity will determine the dependability and reliability of data and value is the value derived by organizations from big data processing.

15. Define HDFS and YARN, and talk about their respective components ?

The Hadoop Distributed File System (HDFS) is the storage unit that’s responsible for storing different types of data blocks in a distributed environment.

The two main components of HDFS are-

NameNode – A master node that processes metadata information for data blocks contained in the HDFS
DataNode – Nodes which act as slave nodes and simply store the data, for use and processing by the NameNode
The Yet Another Resource Negotiator (YARN) is the processing component of Apache Hadoop and is responsible for managing resources and providing an execution environment for said processes.

The two main components of YARN are-

ResourceManager– Receives processing requests and allocates its parts to respective NodeManagers based on processing needs.
NodeManager– Executes tasks on every single Data Node. (Big Data Training)

16. Explain how big data can be used to increase business value ?

While understanding the need for analyzing big data, such analysis will help businesses to identify their position in markets, and help businesses to differentiate themselves from their competitors. For example, from the results of big data analysis, organizations can understand the need for customized products or can understand potential markets towards increasing revenue and value. Analyzing big data will involve grouping data from various sources to understand trends and information related to business. When big data analysis is done in a planned manner by gathering data from the right sources, organizations can easily generate business value and revenue by almost 5% to 20%. Some examples of such organizations are Amazon, Linkedin, WalMart, and many others.
(Big Data Interview Questions For Freshers)

17. What is the functionality of Query Processor in Apached Hive ?

This component implements the processing framework for converting SQL to a graph of map/reduce jobs and the execution time framework to run those jobs in the order of dependencies.

18. Wherever (Different Directory) I run hive query, it creates new metastore_db, please explain the reason for it ?

Whenever you run the hive in embedded mode, it creates the local metastore. And before creating the metastore it looks whether metastore already exist or not. This property is defined in configuration file hive-site.xml.

Propertyis“javax.jdo.option.ConnectionURL”withdefaultvalue“jdbc:derby:;databaseName=metastore_db;create=true”. So to change the behavior change the location to absolute path, so metastore will be used from that location.

19. What is SerDe in Apache Hive ?

A SerDe is a short name for a Serializer Deserializer.
Hive uses SerDe (and FileFormat) to read and write data from tables. An important concept behind Hive is that it DOES NOT own the Hadoop File System (HDFS) format that data is stored in. Users are able to write files to HDFS with whatever tools/mechanism takes their fancy(“CREATE EXTERNAL TABLE” or “LOAD DATA INPATH,” ) and use Hive to correctly “parse” that file format in a way that can be used by Hive.
A SerDe is a powerful (and customizable) mechanism that Hive uses to “parse” data stored in HDFS to be used by Hive.

20. Name some Big Data Products ?


21. How is Hadoop related to Big Data? Describe its components ?

Another fairly simple question. Apache Hadoop is an open-source framework used for storing, processing, and analyzing complex unstructured data sets for deriving insights and actionable intelligence for businesses.

The three main components of Hadoop are-

MapReduce – A programming model which processes large datasets in parallel
HDFS – A Java-based distributed file system used for data storage without prior organization
YARN – A framework that manages resources and handles requests from distributed applications

22. What is the work of Hive/Hcatalog ?

Hive/HCatalog also enables sharing of data structure with external systems including traditional data management tools. (Company)

23. What is WebHCatServer ?

The WebHcatServer provides a REST – like web API for Hcatalog. Applications make HTTP requests to run Pig, Hive, and HCatalog DDL from within applications.

24. Where does Hive store table data by default ?

The default location for the storage of table data by Hive is:

Big Data Interview Questions and Answers-Hive
4. Name some tools or systems used in big data processing.
Big data processing and analysis can be done using,
• Hadoop
• Hive
• Pig
• Mahout
• Flume

(Big Data Interview Questions For Freshers)

25. What is IBM’s simple explanation for Big Data’s four critical features ?

Big Data features:

Volume: Scale of Data
Velocity: Analysis of streaming Data
Variety: Different forms of Data
Veracity: Uncertainly Of Data

26. Define the Port Numbers for Name Node, Task Tracker and Job Tracker ?

NameNode – Port 50070

Task Tracker – Port 50060

Job Tracker – Port 50030

27. What are some of the data management tools used with Edge Nodes in Hadoop ?

Oozie, Ambari, Hue, Pig, and Flume are the most common data management tools that work with edge nodes in Hadoop. Other similar tools include HCatalog, BigTop, and Avr.

28. What is Hive ?


Hive is a data warehouse software which is used for facilitates querying and managing large data sets residing in distributed storage.

Hive language almost look like SQL language called HiveQL. Hive also allows traditional map reduce programs to customize mappers and reducers when it is inconvenient or inefficient to execute the logic in HiveQL (User Defined Functions UDFS).

29. State some key components of a Hadoop application ?

Hadoop Common

30. Name some companies that use Hadoop ?

Yahoo (One of the biggest user & more than 80% code contributor to Hadoop)

31. What are the steps involved in big data solutions ?

Big data solutions follow three standard steps in its implementation. They are:
Data ingestion: This step will define the approach to extract and consolidate data from multiple sources. For example, data sources can be social network feeds, CRM, RDBMS, etc. The data extracted from different sources is stored in Hadoop distributed file system (HDFS).
Data storage: This is the second step, extracted data is stored. This storage can be in HDFS or HBase (NoSQL database).
Process the data: This is the last step. The data stored must be processed. Processing is done using tools such as Spark, Pig, MapReduce, and others.

About the Author:

Leave A Comment