Bigdata Hadoop Interview Questions And Answers Pdf

In this Hadoop interview questions post, we included all the regularly proposed questions that will encourage you to ace the interview with their high-grade solutions. Whereby the market is continuously progressing for Big Data and Hadoop masters.

Earlier, companies were particularly concerned regarding operational data, which signified less than 20% of the entire data. Succeeding, they understood that investigating the entire data will provide genuine business insights & decision-making aptitude. This was the period when big giants like Yahoo, Facebook, Google, etc. began utilizing Hadoop & Big Data associated technologies. In particular, nowadays the identity of each fifth organization is prompting to Big Data analytics. Therefore, the requirement for jobs in Big Data Online TrainingHadoop is increasing like anything. Accordingly, if you desire to boost your career, Hadoop and Spark are presently the technology you want. That would ever give you a great start either as a fresher or experienced.

Top 50 Bigdata Hadoop Interview Questions And Answers Pdf

Big or small, are looking for a quality Big Data and Hadoop training specialists for the Comprehensive concerning these top Hadoop interview questions to obtain a job in Big Data market wherever local and global enterprises, Here the definitive list of top Hadoop interview questions directs you through the questions and answers on various topics like MapReduce, Pig, Hive, HDFS, HBase and, Hadoop Cluster .

Here is the top 50 objective type sample Hadoop Interview questions and their answers are given just below to them. These sample questions are framed by experts from SVR Technologies who train for Learn Hadoop Online Training to give you an idea of the type of questions which may be asked in an interview. We have taken complete interest to provide accurate answers to all the questions.

1. What is fsck?
Answer: fsck is the File System Check. Hadoop HDFS use the fsck (filesystem check) command to check for various inconsistencies. It also reports the problems with the files in HDFS. For example, missing blocks for a file or under-replicated blocks. It is different from the traditional fsck utility for the native file system. Therefore it does not correct the errors it detects.

Normally NameNode automatically corrects most of the recoverable failures. Filesystem check also ignores open files. But it provides an option to select all files during reporting. The HDFS fsck command is not a Hadoop shell command. It can also run as bin/hdfs fsck. Filesystem check can run on the whole file system or on a subset of files.

[-list-corrupt file blocks |
[-move | -delete | -openforwrite]
[-files [-blocks [-locations | -racks]]]

Path- Start checking from this path

  • -delete- Delete corrupted files.
  • -files- Print out the checked files.
  • -files –blocks- Print out the block report.
  • -files –blocks –locations- Print out locations for every block.
  • -files –blocks –rack- Print out network topology for data-node locations
  • -include snapshots- Include snapshot data if the given path indicates or include snapshot table directory.
  • -list -corruptfileblocks- Print the list of missing files and blocks they belong to.

2. Can free form SQL queries be used with Sqoop import command? If yes, then how can they be used?
Sqoop allows us to use free form SQL queries with the import command. The import command should be used with thee and – query options to execute free form SQL queries. When using thee and –query options with the import command the –target dir value must be specified.

3. Explain about ZooKeeper in Kafka?
Apache Kafka uses ZooKeeper to be a highly distributed and scalable system. Zookeeper is used by Kafka to store various configurations and use them across the Hadoop cluster in a distributed manner. To achieve distributed-ness, configurations are distributed and replicated throughout the leader and follower nodes in the ZooKeeper ensemble. We cannot directly connect to Kafka by bye-passing ZooKeeper because if the ZooKeeper is down it will not be able to serve the client request.

4. Differentiate between Sqoop and dist CP?
DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.

5. Is it suggested to place the data transfer utility sqoop on an edge node?
It is not suggested to place sqoop on an edge node or gateway node because the high data transfer volumes could risk the ability of Hadoop services on the same node to communicate. Messages are the lifeblood of any Hadoop service and high latency could result in the whole node being cut off from the Hadoop cluster.

6. Does Flume provide 100% reliability to the data flow?
Yes, Apache Flume provides end to end reliability because of its transactional approach in the data flow.

7. How can Flume be used with HBase?

Apache Flume can be used with HBase using one of the two HBase sinks – HBaseSink (org.apache.flume.sink.hbase.HBaseSink) supports secure HBase clusters and also the novel HBase IPC that was introduced in the version HBase 0.96.

AsyncHBaseSink (org.apache.flume.sink.hbase.AsyncHBaseSink) has better performance than HBase sink as it can easily make non- blocking calls to HBase.

Working of the HBaseSink – In HBaseSink, a Flume Event is converted into HBase Increments or Puts. Serializer implements the HBaseEventSerializer which is then instantiated when the sink starts. For every event, sink calls the initialize method in the serializer which then translates the Flume Event into HBase increments and puts to be sent to HBase cluster.

Working of the AsyncH BaseSink- AsyncHBaseSink implements the AsyncHBaseEventSerializer. The initialize method is called only once by the sink when it starts. Sink invokes the setEvent method and then makes calls to the get increments and get actions methods just similar to HBase sink. When the sink stops, the cleanUp method is called by the serializer.

8. Explain the different channel types in Flume. Which channel type is faster?
The 3 different built-in channel types available in Flume are-

  • MEMORY Channel – Events are read from the source into memory and passed to the sink.
  • JDBC Channel – JDBC Channel stores the events in an embedded Derby database.
  • FILE Channel – File Channel writes the contents to a file on the file system after reading the event from a source. The file is deleted only after the contents are successfully delivered to the sink. MEMORY Channel is the fastest channel among the three however has the risk of data loss. The channel that you choose completely depends on the nature of the big data application and the value of each event. 

9. What are the limitations of importing RDBMS tables into Hcatalog directly?
There is an option to import RDBMS tables into Hcatalog directly by making use of –catalog –database option with the –catalog –table but the limitation to it is that there are several arguments like –as-profile, -direct, -as-sequence file, -target-dir, -export-dir are not supported.

10. Which is the reliable channel in Flume to ensure that there is no data loss?
FILE Channel is the most reliable channel among the 3 channels JDBC, FILE and MEMORY.

11. Does Apache Flume provide support for third-party plug-ins?
Most of the data analysts use Apache Flume has plug-in based architecture as it can load data from external sources and transfer it to external destinations.

12. Name a few companies that use Zookeeper?
Yahoo, Solr, Helprace, Neo4j, Rackspace. 

13. Is it possible to leverage real-time analysis on the big data collected by Flume directly? If yes, then explain how?

Data from Flume can be extracted, transformed and loaded in real-time into Apache Solr servers using MorphlineSolrSink.

14. Explain how Zookeeper works?
 ZooKeeper is referred to as the King of Coordination and distributed applications use ZooKeeper to store and facilitate important configuration information updates. ZooKeeper works by coordinating the processes of distributed applications. ZooKeeper is a robust replicated synchronization service with eventual consistency. A set of nodes is known as an ensemble and persisted data is distributed between multiple nodes.

or more independent servers collectively form a ZooKeeper cluster and elect a master. One client connects to any of the specific servers and migrates if a particular node fails. The ensemble of ZooKeeper nodes is alive until the majority of nods are working. The master node in ZooKeeper is dynamically selected by the consensus within the ensemble so if the master node fails then the role of the master node will migrate to another node which is selected dynamically. Writes are linear and reads are concurrent in ZooKeeper.

15. Differentiate between FileSink and FileRollSink?
The major difference between HDFS FileSink and FileRollSink is that HDFS File Sink writes the events into the Hadoop Distributed File System (HDFS) whereas File Roll Sink stores the events into the local file system. 

16. What is the role of Zookeeper in HBase architecture?
 In HBase architecture, ZooKeeper is the monitoring server that provides different services like –tracking server failure and network partitions, maintaining the configuration information, establishing communication between the clients and region servers, the usability of ephemeral nodes to identify the available servers in the cluster. 

17. List some examples of Zookeeper use cases?
Found by Elastic uses Zookeeper comprehensively for resource allocation, leader election, high priority notifications, and discovery. The entire service of Found built up of various systems that read and write to Zookeeper.

Apache Kafka that depends on ZooKeeper is used by LinkedIn
The storm that relies on ZooKeeper is used by popular companies like Groupon and 6) Explain about the replication and multiplexing selectors in Flume.

Channel Selectors are used to handling multiple channels. Based on the Flume header value, an event can be written just to a single channel or to multiple channels. If a channel selector is not specified to the source then by default it is the Replicating selector. Using the replicating selector, the same event is written to all the channels in the source’s channels list. The multiplexing channel selector is used when the application has to send different events to different channels.

18. What are the additional benefits YARN brings in to Hadoop?
Effective utilization of the resources as multiple applications can be run in YARN all sharing a common resource. In Hadoop MapReduce, there are separate slots for Map and Reduce tasks whereas in YARN there is no fixed slot. The same container can be used for Map and Reduce tasks leading to better utilization.

YARN is backward compatible so all the existing MapReduce jobs.
Using YARN, one can even run applications that are not based on the Map-Reduce model.

19. What are the modules that constitute the Apache Hadoop 2.0 framework?
Hadoop 2.0 contains four important modules of which 3 are inherited from Hadoop 1.0 and a new module YARN is added to it.

  • Hadoop Common – This module consists of all the basic utilities and libraries required by other modules.
  • HDFS- Hadoop Distributed file system that stores huge volumes of data on commodity machines across the cluster.
  • MapReduce- Java based programming model for data processing.
  • YARN- This is a new module introduced in Hadoop 2.0 for cluster resource management and job scheduling.

20. What are the different types of Znodes?
There are 2 types of Znodes namely- Ephemeral and Sequential znodes.

  1. The Znodes that get destroyed as soon as the client that created it disconnects is referred to as Ephemeral znodes.
  2. Sequential Znode is the one in which sequential number is chosen by the ZooKeeper ensemble and is pre-fixed when the client assigns a name to the node. 

21. Explain about cogroup in Pig?
COGROUP operator in Pig is used to work with multiple tuples. COGROUP operator is applied on statements that contain or involve two or more relations. The COGROUP operator can be applied on up to 127 relations at a time. When using the COGROUP operator on two tables at once-Pig first groups both the tables and after that joins the two tables on the grouped columns.

22. How to use Apache Zookeeper command-line interface?
 ZooKeeper has command-line client support for interactive use. The command-line interface of ZooKeeper is similar to the file and shell system of UNIX. Data in ZooKeeper is stored in a hierarchy of Znodes where each node can contain data just similar to a file. Each node can also have children just like directories in the UNIX file system.

Zookeeper-client command is used to launch the command-line client. If the initial prompt is hidden by the log messages after entering the command, users can just hit ENTER to view the prompt.

23. What are different modes of execution in Apache Pig?
Apache Pig runs in 2 modes- one is the “Pig (Local Mode) Command Mode” and the other is the “Hadoop MapReduce (Java) Command Mode”. Local Mode requires access to only a single machine where all files are installed and executed on a local host whereas MapReduce requires accessing the Hadoop cluster.

24. What are the watches?
Client disconnection might be a troublesome problem especially when we need to keep a track on the state of Znodes at regular intervals. ZooKeeper has an event system referred to as watch which can be set on Znode to trigger an event whenever it is removed, altered or any new children are created below it.

25. How can you connect an application, if you run Hive as a server?
When running Hive as a server, the application can be connected in one of the 3 ways-

  • ODBC Driver-This supports the ODBC protocol
  • JDBC Driver- This supports the JDBC protocol
  • Thrift Client- This client can be used to make calls to all hive commands using a different programming language like PHP, Python, Java, C++, and Ruby.

26. Explain the differences between Hadoop 1.x and Hadoop 2.x?
In Hadoop 1.x, MapReduce is responsible for both processing and cluster management whereas in Hadoop 2.x processing is taken care of by other processing models and YARN is responsible for cluster management.

  • Hadoop 2.x scales better when compared to Hadoop 1.x with close to 10000 nodes per cluster.
  • Hadoop 1.x has a single point of failure problem and whenever the NameNode fails it has to be recovered manually. However, in case of Hadoop 2.x StandBy NameNode overcomes the SPOF problem and whenever the NameNode fails it is configured for automatic recovery.
  • Hadoop 1.x works on the concept of slots whereas Hadoop 2.x works on the concept of containers and can also run generic tasks.

27. What are the core changes in Hadoop 2.0?
Hadoop 2.x provides an upgrade to Hadoop 1.x in terms of resource management, scheduling and the manner in which execution occurs. In Hadoop 2.x the cluster resource management capabilities work in isolation from the MapReduce specific programming logic. This helps Hadoop to share resources dynamically between multiple parallel processing frameworks like Impala and the core MapReduce component. Hadoop 2.x Hadoop 2.x allows workable and fine-grained resource configuration leading to efficient and better cluster utilization so that the application can scale to process a larger number of jobs.

28. What problems can be addressed by using Zookeeper?
 In the development of distributed systems, creating own protocols for coordinating the Hadoop cluster results in failure and frustration for the developers. The architecture of a distributed system can be prone to deadlocks, inconsistency and race conditions. This leads to various difficulties in making the Hadoop cluster fast, reliable and scalable. To address all such problems, Apache ZooKeeper can be used as a coordination service to write correct distributed applications without having to reinvent the wheel from the beginning.

29. What does the overwrite keyword denote in Hive load statement?
 Overwrite keyword in Hive load statement deletes the contents of the target table and replaces them with the files referred by the file path i.e. the files that are referred by the file path will be added to the table when using the overwrite keyword.

30. For what kind of big data problems, did the organization choose to use Hadoop?
Asking this question to the interviewer shows the candidates keen interest in understanding the reason for Hadoop implementation from a business perspective. This question gives the impression to the interviewer that the candidate is not merely interested in the Hadoop developer job role but is also interested in the growth of the company. 

31. What is SerDe in Hive? How can you write your own custom SerDe?
SerDe is a Serializer DeSerializer. Hive uses SerDe to read and write data from tables. Generally, users prefer to write a Deserializer instead of a SerDe as they want to read their own data format rather than writing to it. If the SerDe supports DDL i.e. basically SerDe with parameterized columns and different column types, the users can implement a Protocol based Dynamic SerDe rather than writing the SerDe from scratch.

32. Differentiate between NFS, Hadoop NameNode and JournalNode?
HDFS is a write-once file system so a user cannot update the files once they exist either they can read or write to it. However, under certain scenarios in the enterprise environment like file uploading, file downloading, file browsing or data streaming –it is not possible to achieve all this using the standard HDFS. This is where a distributed file system protocol Network File System (NFS) is used. NFS allows access to files on remote machines just similar to how the local file system is accessed by applications.

  • Namenode is the heart of the HDFS file system that maintains the metadata and tracks where the file data is kept across the Hadoop cluster.
  • StandBy Nodes and Active Nodes communicate with a group of lightweight nodes to keep their state synchronized. These are known as Journal Nodes.

33. How can native libraries be included in YARN jobs?
There are two ways to include native libraries in YARN jobs-

  1. By setting the -Djava.library.path on the command line but in this case, there are chances that the native libraries might not be loaded correctly and there is a possibility of errors.
  2. The better option to include native libraries is to set the LD_LIBRARY_PATH in the .bashrc file.

34. What are the various tools you used in the big data and Hadoop projects you have worked on?
Your answer to these interview questions will help the interviewer understand your expertise in Hadoop based on the size of the Hadoop cluster and number of nodes. Based on the highest volume of data you have handled in your previous projects, the interviewer can assess your overall experience in debugging and troubleshooting issues involving huge Hadoop clusters.

The number of tools you have worked with the help an interviewer judge that you are aware of the overall Hadoop ecosystem and not just MapReduce. To be selected, it all depends on how well you communicate the answers to all these questions.

35. How is the distance between two nodes defined in Hadoop?
Measuring bandwidth is difficult in Hadoop so the network is denoted as a tree in Hadoop. The distance between two nodes in the tree plays a vital role in forming a Hadoop cluster and is defined by the network topology and java interface D N Sto Switch Mapping. The distance is equal to the sum of the distance to the closest common ancestor of both the nodes. The method getDistance(Node node1, Node node2) is used to calculate the distance between two nodes with the assumption that the distance from a node to its parent node is always1.

36. What is your favorite tool in the Hadoop ecosystem?
The answer to this question will help the interviewer know more about the big data tools that you are well-versed with and are interested in working with. If you show affinity towards a particular tool then the probability that you will be deployed to work on that particular tool is more. If you say that you have a good knowledge of all the popular big data tools like a Pig, Hive, HBase, Sqoop, flume then it shows that you have knowledge about the Hadoop ecosystem as a whole.

37. What is the size of the biggest Hadoop cluster a company X operates?
Asking this question helps a Hadoop job seeker understand the Hadoop maturity curve at a company. Based on the answer of the interviewer, a candidate can judge how much an organization invests in Hadoop and their enthusiasm to buy big data products from various vendors. The candidate can also get an idea on the hiring needs of the company based on their Hadoop infrastructure.

Based on the answer to question no 1, the candidate can ask the interviewer why the Hadoop infrastructure is configured in that particular way, why the company chose to use the selected big data tools and how workloads are constructed in the Hadoop environment.

Asking this question to the interviewer gives the impression that you are not just interested in maintaining the big data system and developing products around it but are also seriously thoughtful on how the infrastructure can be improved to help business growth and make cost savings.

38. What are the features of Pseudo mode?
Just like the Standalone mode, Hadoop can also run on a single-node in this mode. The difference is that each Hadoop daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.

The pseudo mode is suitable for both for development and in the testing environment. In the Pseudo mode, all the daemons run on the same machine.

  • Data Quality – In the case of Big Data, data is very messy, inconsistent and incomplete.
  • Discovery – Using a powerful algorithm to find patterns and insights are very difficult.
  • Hadoop is an open- source software framework that supports the storage and processing of large data sets. Apache Hadoop is the best solution for storing and processing Big data because:
  • Apache Hadoop stores huge files as they are (raw) without specifying any schema.
  • High scalability – We can add any number of nodes, hence enhancing performance dramatically.
  • Reliable – It stores data reliably on the cluster despite machine failure.
  • High availability – In Hadoop data is highly available despite hardware failure. If a machine or hardware crashes, then we can access data from another path.
  • Economic – Hadoop runs on a cluster of commodity hardware which is not a very expensive
  • case of hardware failure. It provides high throughput access to an application by accessing in parallel.
  • MapReduce- MapReduce is the data processing layer of Hadoop. It writes an application that processes large structured and unstructured data stored in HDFS. MapReduce processes a huge amount of data in parallel. It does this by dividing the job (submitted job) into a set of independent tasks (sub-job). In Hadoop, MapReduce works by breaking the processing into phases: Map and Reduce. The Map is the first phase of processing, where we specify all the complex logic code. Reduce is the second phase of processing. Here we specify lightweight processing like aggregation/summation.
  • YARN- YARN is the processing framework in Hadoop. It provides Resource management and allows multiple data processing engines. For example real-time streaming, data science, and batch processing.
  • Easy to use – No need of the client to deal with distributed computing, the framework take care of all the things. So it is easy to use.

39. How were you involved in data modeling, data ingestion, data transformation, and data aggregation?
You are likely to be involved in one or more phases when working with big data in a Hadoop environment. The answer to this question helps the interviewer understand what kind of tools you are familiar with. If you answer that your focus was mainly on data ingestion then they can expect you to be well-versed with Sqoop and Flume, if you answer that you were involved in data analysis and data transformation then it gives the interviewer an impression that you have expertise in using Pig and Hive.

40. What are the features of Fully-Distributed mode?
In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, we allow separate nodes for Master and Slave.

We use this model in the production environment, where ‘n’ number of machines forming a cluster. Hadoop daemons run on a cluster of machines. There is one host onto which NameNode is running and the other hosts on which DataNodes are running. Therefore, NodeManager installs on every DataNode. And it is also responsible for the execution of the task on every single DataNode.

The ResourceManager manages all these NodeManager. ResourceManager receives the processing requests. After that, it passes the parts of the request to corresponding NodeManager accordingly.

41. In your previous project, did you maintain the Hadoop cluster in-house or used Hadoop in the cloud?
Most of the organizations still do not have the budget to maintain Hadoop cluster in-house and they make use of Hadoop in the cloud from various vendors like Amazon, Microsoft, Google, etc. The interviewer gets to know about your familiarity with using Hadoop in the cloud because if the company does not have an in-house implementation then hiring a candidate who has knowledge about using Hadoop in the cloud is worth it.

42. What are the modes in which Hadoop run?

Apache Hadoop runs in three modes:

  • Local (Standalone) Mode – Hadoop by default run in a single-node, non-distributed mode, as a single Java process. The local mode uses the local file system for input and output operation. It is also used for debugging purpose, and it does not support the use of HDFS. Further, in this mode, there is no custom configuration required for configuration files.
  • Pseudo-Distributed Mode – Just like the Standalone mode, Hadoop also runs on a single-node in a Pseudo-distributed mode. The difference is that each daemon runs in a separate Java process in this Mode. In Pseudo-distributed mode, we need configuration for all the four files mentioned above. In this case, all daemons are running on one node and thus, both Master and Slave node are the same.
    Fully-Distributed Mode – In this mode, all daemons execute in separate nodes forming a multi-node cluster. Thus, it allows.

43. Compare Hadoop and RDBMS?
Apache Hadoop is the future of the database because it stores and processes a large amount of data. Which will not be possible with the traditional database. There is some difference between Hadoop and RDBMS which are as follows:

  • Architecture – Traditional RDBMS have ACID properties. Whereas Hadoop is a distributed computing framework has two main components: a distributed file system (HDFS) and MapReduce.
  • Data acceptance – RDBMS accepts only structured data. While Hadoop can accept both structured as well as unstructured data. It is a great feature of Hadoop, as we can store everything in our database and there will be no data loss.
  • Scalability – RDBMS is a traditional database which provides vertical scalability. So if the data increases for storing then we have to increase particular system configuration. While Hadoop provides horizontal scalability. So we just have to add one or more node to the cluster if there is any requirement for an increase in data.
  • OLTP (Real-time data processing) and OLAP – Traditional RDMS support OLTP (Real-time data processing). OLTP is not supported in Apache Hadoop. Apache Hadoop supports large scale Batch Processing workloads (OLAP).
  • Cost – Licensed software, therefore we have to pay for the software. Whereas Hadoop is an open source framework, so we don’t need to pay for software.

If you have any doubts or queries regarding Hadoop Interview Questions at any point you can ask that Hadoop Interview question to us in the comment section and our support team will get back to you.

44. How is security achieved in Hadoop?
Apache Hadoop achieves security by using Kerberos.

At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.
  • Computation – This reduces network congestion and therefore, enhances the overall system throughput.
  • separate nodes for Master and Slave.

45. What are the features of Standalone (local) mode?
Apache Hadoop achieves security by using Kerberos.

At a high level, there are three steps that a client must take to access a service when using Kerberos. Thus, each of which involves a message exchange with a server.

  • Authentication – The client authenticates itself to the authentication server. Then, receives a timestamped Ticket-Granting Ticket (TGT).
  • Authorization – The client uses the TGT to request a service ticket from the Ticket Granting Server.
  • Service Request – The client uses the service ticket to authenticate itself to the server.
  • computation – This reduces network congestion and therefore, enhances the overall system throughput.
  • separate nodes for Master and Slave.

46. What are the limitations of Hadoop?
Various limitations of Hadoop are:

The issue with small files – Hadoop is not suited for small files. Small files are the major problems in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB). If you are storing these large number of small files, HDFS can’t handle these lots of files. As HDFS works with a small number of large files for storing data sets rather than a larger number of small files. If one use the huge number of small files, then this will overload the namenode. Since name node stores the namespace of HDFS.

HAR files, Sequence files, and Hbase overcome small files issues.

Processing Speed – With parallel and distributed algorithm, MapReduce process large data sets. MapReduce performs the task: Map and Reduce. MapReduce requires a lot of time to perform these tasks thereby increasing latency. As data is distributed and processed over the cluster in MapReduce. So, it will increase the time and reduces processing speed.

Support only Batch Processing – Hadoop supports only batch processing. It does not process streamed data and hence, overall performance is slower. MapReduce framework does not leverage the memory of the cluster to the maximum.

Iterative Processing – Hadoop is not efficient for iterative processing. Hadoop does not support cyclic data flow. That is the chain of stages in which the input to the next stage is the output from the previous stage.

Vulnerable by nature – Hadoop is entirely written in Java, a language most widely used. Hence java been most heavily exploited by cyber-criminal. Therefore it implicates in numerous security breaches.

Security- Hadoop can be challenging in managing the complex application. Hadoop is missing encryption at storage and network levels, which is a major point of concern. Hadoop supports Kerberos authentication, which is hard to manage.

The core Hadoop Interview Questions are for experienced, but freshers and Students can also read and refer them for advanced understanding

47. Explain Data Locality in Hadoop?
Hadoop major drawback was cross-switch network traffic due to the huge volume of data. To overcome this drawback, Data locality came into the picture. It refers to the ability to move the computation close to where the actual data resides on the node, instead of moving large data to computation. Data locality increases the overall throughput of the system.

In Hadoop, HDFS stores datasets. Datasets are divided into blocks and stored across the data nodes in the Hadoop cluster. When a user runs the MapReduce job then NameNode sends this MapReduce code to the datanodes on which data is available related to MapReduce job.

Data locality has three categories:

  • Data local – In this category data is on the same node as the mapper working on the data. In such a case, the proximity of the data is closer to the computation. This is the most preferred scenario.
  • Intra – Rack- In this scenarios mapper run on the different node but on the same rack. As it is not always possible to execute the mapper on the same data node due to constraints.
  • Inter-Rack – In this scenarios mapper run on the different rack. As it is not possible to execute mapper on a different.

48. What are the different commands used to startup and shutdown Hadoop daemons?

  • To start all the hadoop daemons use: ./sbin/
  • Then, to stop all the Hadoop daemons use:./sbin/
  • You can also start all the pdfs daemons together using ./sbin/ Yarn daemons together using ./sbin/ MR Job history server using /bin/Mr-job, start the history server. Then, to stop these daemons we can use

/bin/Mr-job, stop history server.

Finally, the last way is to start all the daemons individually. Then, stop them individually:
./sbin/ start namenode
./sbin/ start datanode
./sbin/ start resourcemanager
./sbin/ start nodemanager
./sbin/Mr-job start history server

49. What does jps command do in Hadoop?
The jobs command helps us to check if the Hadoop daemons are running or not. Thus, it shows all the Hadoop daemons that are running on the machine. Daemons are Namenode, Datanode, ResourceManager, NodeManager, etc.

fs.checkpoint.dir is the directory on the file system. On which secondary NameNode stores the temporary images of edit logs. Then this Edit Logs and FsImage will merge for backup.

50. How to debug Hadoop code?
First, check the list of MapReduce jobs currently running. Then, check whether orphaned jobs is running or not; if yes, you need to determine the location of RM logs.

  • First of all, Run: “ps –ef| grep –I ResourceManager” and then, look for log directory in the displayed result. Find out the job-id from the displayed list. Then check whether the error message associated with that job or not.
  • Now, on the basis of RM logs, identify the worker node which involves in the execution of the task.
  • Now, login to that node and run- “ps –ef| grep –I NodeManager”
  • Examine the NodeManager log.
  • The majority of errors come from user level logs for each map-reduce job.

Note: Browse latest  Hadoop interview questions and Hadoop tutorial. Here you can check  Hadoop Training details and  Hadoop Videos for self learning. Contact +91 988 502 2027 for more information.

Leave a Comment