1. What is Big Data?
Answer: It describes the large volume of Data both Structured and Unstructured.
The term Big Data refers to simply use of predictive analytics, user behavior analytics and other advanced data analytics methods.
It is extract value from data and seldom to a particular size to the data set.
The challenge includes capture, storage, search, sharing, transfer, analysis, creation.
2. What do you know about the term “Big Data”?
Answer: Big Data is a term associated with complex and large datasets. A relational database cannot handle big data, and that’s why special tools and methods are used to perform operations on a vast collection of data. Big data enables companies to understand their business better and helps them derive meaningful information from the unstructured and raw data collected on a regular basis. Big data also allows the companies to make better business decisions backed by data.
3. Explain the NameNode recovery process?
Answer: The NameNode recovery process involves the below-mentioned steps to make Hadoop cluster running:
In the first step in the recovery process, file system metadata replica (FsImage) starts a new NameNode.
The next step is to configure the DataNodes and Clients. These DataNodes and Clients will then acknowledge new NameNode.
During the final step, the new NameNode starts serving the client on the completion of last checkpoint FsImage loading and receiving block reports from the DataNodes.
Note: Don’t forget to mention, this NameNode recovery process consumes a lot of time on large Hadoop clusters. Thus, it makes routine maintenance difficult. For this reason, HDFS high availability architecture is recommended to use.
Big Data world is expanding continuously and thus a number of opportunities are arising for the Big Data professionals. This top Big Data interview Q & A set will surely help you in your interview. However, we can’t neglect the importance of certifications. So, if you want to demonstrate your skills to your interviewer during big data interview get certified and add a credential to your resume.
4. What is the purpose of the JPS command in Hadoop?
Answer: The JBS command is used to test whether all Hadoop daemons are running correctly or not. It specifically checks daemons in Hadoop like the NameNode, DataNode, ResourceManager, NodeManager, and others.
5. Explain the core methods of a Reducer?
Answer: There are three core methods of a reducer. They are-
setup() – Configures different parameters like distributed cache, heap size, and input data.
reduce() – A parameter that is called once per key with the concerned reduce task
cleanup() – Clears all temporary files and called only at the end of a reducer task.
6. Where does Big Data come from?
Answer: There are three sources of Big Data
Social Data: It comes from the social media channel’s insights on consumer behavior.
Machine Data: It consists of real-time data generated from sensors and weblogs. It tracks user behavior online.
Transaction Data: It generated by large retailers and B2B Companies frequent basis.
7. How are file systems checked in HDFS?
Answer: File system is used to control how data are stored and retrieved.
Each file system has a different structure and logic properties of speed, security, flexibility, size.
Such kind of file system designed in hardware. This file includes NTFS, UFS, XFS, HDFS.
8. What are the four features of Big Data?
Answer: The four V’s renders the perceived value of data. It is as valuable as the business results bringing improvements in operational efficiency.
9. What are some of the interesting facts about Big Data?
Answer: According to the experts of the industry, digital information will grow to 40 zettabytes by 2020
Surprisingly, every single minute of a day, more than 500 sites come into existence. This data is certainly vital and also awesome
With the increase in the number of smartphones, companies are funneling their money into it by carrying mobility to the business with apps
It is said that Walmart collects 2.5 petabytes of data every hour from its consumer transactions
10. How will you define checkpoint?
Answer: It is the main part of maintaining filesystem metadata in HDFS. It creates checkpoints of file system metadata by joining fsimage with edit log. The new version of the image is named as Checkpoint.
11. What types of biases can happen through sampling?
- Survivorship bias
- Selection bias
- Under coverage bias
These big data interview questions and answers will help you get a dream job of yours. You can always learn and develop new Big Data skills by taking one of the best Big Data courses.
12. Pig Latin contains different relational operations; name them?
Answer: The important relational operations in Pig Latin are:
- for each
- order by
13. What is the meaning of big data and how is it different?
Answer: Big data is the term to represent all kind of data generated on the internet. On the internet over hundreds of GB of data is generated only by online activity. Here, online activity implies web activity, blogs, text, video/audio files, images, email, social network activity, and so on. Big data can be referred to as data created from all these activities. Data generated online is mostly in unstructured form. Big data will also include transactions data in the database, system log files, along with data generated from smart devices such as sensors, IoT, RFID tags, and so on in addition to online activities.
Big data needs specialized systems and software tools to process all unstructured data. In fact, according to some industry estimates almost 85% data generated on the internet is unstructured. Usually, relational databases have structured format and the database is centralized. Hence, RDBMS processing can be quickly done using a query language such as SQL. On the other hand, big data is very large and is distributed across the internet and hence processing big data will need distributed systems and tools to extract information from them. Big data needs specialized tools such as Hadoop, Hive, or others along with high-performance hardware and networks to process them.v
14. Why is big data important for organizations?
Answer: Big data is important because by processing big data, organizations can obtain insight information related to:
- Cost reduction
- Improvements in products or services
- To understand customer behavior and markets
- Effective decision making
- To become more competitive
15. What is big data solution implementation?
Answer: Big data solutions are implemented at a small scale first, based on a concept as appropriate for the business. From the result, which is a prototype solution, the business solution is scaled further. Some of the best practices followed in the industry include,
- To have clear project objectives and to collaborate wherever necessary
- Gathering data from the right sources
- Ensure the results are not skewed because this can lead to wrong conclusions
- Be prepared to innovate by considering hybrid approaches in processing by including data from structured and unstructured types, include both internal and external data sources
- Understand the impact of big data on existing information flows in the organization. (company)
16. Which hardware configuration is most beneficial for Hadoop jobs?
Answer: It is best to use dual processors or core machines with 4 / 8 GB RAM and ECC memory for conducting Hadoop operations. Though ECC memory cannot be considered low-end, it is helpful for Hadoop users as it does not deliver any checksum errors. The hardware configuration for different Hadoop jobs would also depend on the process and workflow needs of specific projects and may have to be customized accordingly.
17. What is Hive Metastore?
Answer: Hive megastore is a database that stores metadata about your Hive tables (eg. Table name, column names and types, table location, storage handler being used, number of buckets in the table, sorting columns if any, partition columns if any, etc.).
When you create a table, this megastore gets updated with the information related to the new table which gets queried when you issue queries on that table.
Hive is a central repository of hive metadata. it has 2 parts of services and data. by default, it uses derby DB in local disk. it is referred to as embedded megastore configuration. It tends to the limitation that only one session can be served at any given point of time.
18. What kind of Dataware house application is suitable?
Answer: Hive is not a full database. The design constraints and limitations of Hadoop and HDFS impose limits on what Hive can do.
Hive is most suited for data warehouse applications, where
1) Relatively static data is analyzed,
2) Fast response times are not required, and
3) When the data is not changing rapidly.
Hive doesn’t provide crucial features required for OLTP, Online Transaction Processing. It’s closer to being an OLAP tool, Online Analytic Processing. So, Hive is best suited for data warehouse applications, where a large data set is maintained and mined for insights, reports, etc.
19. what are Binary storage formats hive supports?
Answer: Hive natively supports the text file format, however, hive also has support for other binary formats. Hive supports Sequence, Avro, RCFiles.
Sequence files: – General binary format. splittable, compressible and row-oriented. a typical example can be. if we have lots of small files, we may use a sequence file as a container, where filename can be a key and content could store as value. it supports compression which enables huge gain in performance.
Avro datafiles:-Same as Sequence file splittable, compressible and row-oriented except support of schema evolution and multilingual binding support.
files: – Record columnar file, it’s a column-oriented storage file. it breaks table in row split. in each split stores that value of the first row in the first column and followed sub subsequently.
20. What are the main configuration parameters in a “MapReduce” program?
Answer: The main configuration parameters which users need to specify in the “MapReduce” framework are:
- Job’s input locations in the distributed file system
- Job’s output location in the distributed file system
- The input format of data
- The output format of data
- Class containing the map function
- Class containing the reduce function
- JAR file containing the mapper, reducer and driver classes
21. Differentiate between Sqoop and distal?
Answer: DistCP utility can be used to transfer data between clusters whereas Sqoop can be used to transfer data only between Hadoop and RDBMS.
22. Talk about the different tombstone markers used for deletion purposes in HBase?
Answer: There are three main tombstone markers used for deletion in HBase. They are-
Family Delete Marker – Marks all the columns of a column family
Version Delete Marker – Marks a single version of a single column
Column Delete Marker– Marks all the versions of a single column
Hadoop trends constantly change with the evolution of Big Data which is why re-skilling and updating your knowledge and portfolio pieces are important.
Be prepared to answer questions related to Hadoop management tools, data processing techniques, and similar Big Data Hadoop interview questions which test your understanding and knowledge of Data Analytics.
At the end of the day, your interviewer will evaluate whether or not you’re a right fit for their company, which is why you should have your tailor your portfolio according to prospective business or enterprise requirements.
23. What are the key steps in Big Data Solutions?
Answer: Key steps in Big Data Solutions
Ingesting Data, Storing Data (Data Modelling), and Processing data (Data wrangling, Data transformations, and querying data).
- Ingesting Data
- RDBMsRelational Database Management Systems like Oracle, MySQL, etc.
- ERPs Enterprise Resource planning (ERP) systems like SAP.
- CRMCustomer Relationships Management systems like Siebel, Salesforce, etc.
- Social Media feeds and log files.
- Flat files, docs, and images.
- Data Storage Formats
- Data Modelling
- Metadata management
24. What is Big Data Analysis?
Answer: It is defined as the process of mining large structured/unstructured data sets.
It helps to find out underlying patterns, unfamiliar and other useful information within a data leading to business benefits.
25. Where the Mappers Intermediate data will be stored?
Answer: The mapper output is stored in the local file system of each individual mapper node.
Temporary directory location can be set up in the configuration
By the Hadoop administrator.
The intermediate data is cleaned up after the Hadoop Job completes.
26. What is speculative execution?
Answer: It is an optimization technique.
The computer system performs some task that may not be actually needed.
This approach is employed in a variety of areas, including branch prediction in pipelined processors, optimistic concurrency control in database systems.
27. What do you mean by logistic regression?
Answer: Also known as the logit model, Logistic Regression is a technique to predict the binary result from a linear amalgamation of predictor variables.
28. How Big Data can help increase the revenue of the businesses?
Answer: Big data is about using data to expect future events in a way that progresses the bottom line. There are oodles of ways to increase profit. From email to a site, to phone calls and interaction with people, this brings information about the client’s performance. Undoubtedly, a deeper understanding of consumers can improve business and customer loyalty. Big data offers an array of advantages to the table, all you have to do is use it more efficiently in order to an increasingly competitive environment.
29. What are the responsibilities of a data analyst?
Answer: Helping marketing executives know which products are the most profitable by season, customer type, region and other feature
Tracking external trends relatives to geographies, demographics and specific products
Ensure customers and employees relate well
Explaining the optimal staffing plans to cater to the needs of executives looking for decision support.
30. What do you know about collaborative filtering?
Answer: A set of technologies that forecast which items a particular consumer will like depending on the preferences of scores of individuals. It is nothing but the tech word for questioning individuals for suggestions.
31. What is a block in Hadoop Distributed File System (HDFS)?
Answer: When the file is stored in HDFS, all file system breaks down into a set of blocks and HDFS unaware of what is stored in the file. Block size in Hadoop must be 128MB. This value can be tailored for individual files.
32. Define Active and Passive Namenodes?
Answer: Active NameNode runs and works in the cluster whereas Passive NameNode has comparable data like active NameNode.
33. Which are the essential Hadoop tools for the effective working of Big Data?
Answer: Ambari, “Hive”, “HBase, HDFS (Hadoop Distributed File System), Sqoop, Pig, ZooKeeper, NoSQL, Lucene/SolrSee, Mahout, Avro, Oozie, Flume, GIS Tools, Clouds, and SQL on Hadoop are some of the many Hadoop tools that enhance the performance of Big Data.
34. It’s true that HDFS is to be used for applications that have large data sets. Why is it not the correct tool to use when there are many small files?
Answer: In most cases, HDFS is not considered as an essential tool for handling bits and pieces of data spread across different small-sized files. The reason behind this is “Namenode” happens to be a very costly and high-performing system. The space allocated to “Namenode” should be used for essential metadata that’s generated for a single file only, instead of numerous small files. While handling large quantities of data attributed to a single file, “Namenode” occupies lesser space and therefore gives off optimized performance. With this in view, HDFS should be used for supporting large data files rather than multiple files with small data.
35. What are the main distinctions between NAS and HDFS?
Answer: HDFS needs a cluster of machines for its operations, while NAS runs on just a single machine. Because of this, data redundancy becomes a common feature in HDFS. As the replication protocol is different in the case of NAS, the probability of the occurrence of redundant data is much less.
Data is stored on dedicated hardware in NAS. On the other hand, the local drives of the machines in the cluster are used for saving data blocks in HDFS.
Unlike HDFS, Hadoop MapReduce has no role in the processing of NAS data. This is because computation is not moved to data in NAS jobs, and the resultant data files are stored without the same.
36. What is ObjectInspector functionality?
Answer: Hive uses ObjectInspector to analyze the internal structure of the row object and also the structure of the individual columns.
ObjectInspector provides a uniform way to access complex objects that can be stored in multiple formats in the memory, including:
- An instance of a Java class (Thrift or native Java)
- A standard Java object (we use java.util.List to represent
- Struct and Array, and use java.util.Map to represent Map)
- A lazily-initialized object (For example, a Struct of string
- fields stored in a single Java string object with starting offset for each field)
- A complex object can be represented by a pair of
- ObjectInspector and Java Object. The ObjectInspector not only tells us the structure of the Object but also gives us ways to access the internal fields inside the Object.
37. Is it possible to create multiple tables in the hive for the same data?
Answer: Hive creates a schema and appends on top of an existing data file. One can have multiple schemas for one data file, the schema would be saved in hive’s megastore and data will not be parsed read or serialized to disk in a given schema. When s/he will try to retrieve data schema will be used. Let’s say if my file has 5 columns (Id, Name, Class, Section, Course) we can have multiple schemas by choosing any number of the column.
38. Give examples of the SerDe classes which hive uses to Serialize and Deserialize data?
Answer: Hive currently uses these SerDe classes to serialize and deserialize data:
- MetadataTypedColumnsetSerDe: This SerDe is used to read/write delimited records like CSV, tab-separated control-A separated records (quote is not supported yet.)
- ThriftSerDe: This SerDe is used to read/write thrift serialized objects. The class file for the Thrift object must be loaded first.
- DynamicSerDe: This SerDe also read/write thrift serialized objects, but it understands thrift DDL so the schema of the object can be provided at runtime. Also, it supports a lot of different protocols, including TBinaryProtocol, TJSONProtocol, TCTLSeparatedProtocol (which writes data in delimited records).
39. Explain “Big Data” and what are five V’s of Big Data?
Answer: “Big data” is the term for a collection of large and complex data sets, that makes it difficult to process using relational database management tools or traditional data processing applications. It is difficult to capture, curate, store, search, share, transfer, analyze, and visualize Big data. Big Data has emerged as an opportunity for companies. Now they can successfully derive value from their data and will have a distinct advantage over their competitors with enhanced business decisions making capabilities.