HBase Interview Questions and Answers
1. What are the key components of HBase?
Ans: The key components of HBase are Zookeeper, RegionServer and HBase Master.
- Key components of HBase
- Component Description
Region Server A table can be divided into several regions. A group of regions is served to the clients by a Region Server
HMaster It coordinates and manages the Region Servers (similar as NameNode manages DataNodes in HDFS).
Zookeeper acts like as a coordinator inside HBase distributed environment. It helps in maintaining server state inside the cluster by communicating through sessions.
2. What do you mean by WAL?
Ans: It stands for Write Ahead Log. It is basically a log which is responsible for recording all the changes in the data irrespective of the mode of their change. Generally, it is considered as the standard sequence file. It is actually very useful to consider after the issues like server crash or failure. The users can still access data through it during such problems.
3. Explain the difference between HBase and Hive?
Ans: This is the advanced HBase Interview Question asked in an interview. HBase and Hive both are completely different Hadoop based technologies for data processing. Hive is a relational-like SQL compatible distributed storage framework while HBase is a NoSQL key-value store. Hive acts as an abstraction layer on top of Hadoop with SQL support.HBase data access pattern is very limited with two primary operations-get and scan. HBase is ideal for real-time data processing where Hive is an ideal choice for batch data processing. (E Learning Portal)
4. What is decorating Filters?
Ans: It is useful to modify, or extend, the behavior of a filter to gain additional control over the returned data. These types of filters are known as decorating filter. It includes SkipFilter and WhileMatchFilter.
5. Compare HBase with Cassandra?
Ans: Both Cassandra and HBase are NoSQL databases, a term for which you can find numerous definitions. Generally, it means you cannot manipulate the database with SQL. However, Cassandra has implemented CQL (Cassandra Query Language), the syntax of which is obviously modeled after SQL.
Both are designed to manage extremely large data sets. HBase documentation proclaims that an HBase database should have hundreds of millions or — even better — billions of rows. Anything less, and you’re advised to stick with an RDBMS.
Both are distributed databases, not only in how data is stored but also in how the data can be accessed. Clients can connect to any node in the cluster and access any data.
In both Cassandra and HBase, the primary index is the row key, but data is stored on disk such that column family members are kept in close proximity to one another. It is, therefore, important to carefully plan the organization of column families. To keep query performance high, columns with similar access patterns should be placed in the same column family. Cassandra lets you create additional, secondary indexes on column values. This can improve data access in columns whose values have a high level of repetition — such as a column that stores the state field of a customer’s mailing address.
HBase lacks built-in support for secondary indexes but offers a number of mechanisms that provide secondary index functionality. These are described in HBase’s online reference guide and on HBase community.
6. What do you mean by the region server?
Ans: Generally, the databases have a huge volume of data to deal with. It is not always possible and necessary that all the data is linked to a single server. There is a central controller and the same should specify the server with which a specific data is concerned with or placed on. The same is known as Region server. It is also considered as a file on the system that let the users display the defined server names which are associated.
7. How do we back up a HBase cluster?
Ans: There are two broad strategies for performing HBase backups: backing up with a full cluster shutdown, and backing up on a live cluster. Each approach has benefits and limitation.
Full Shutdown Backup
Some environments can tolerate a periodic full shutdown of their HBase cluster, for example, if it is being used as a back-end process and not serving front-end webpages.
Stop HBase: Stop the HBase services first.
Distcp: Distcp could be used to either copy the contents of the HBase directory in HDFS to either the same cluster in another directory, or to a different cluster.
Restore: The backup of the HBase directory from HDFS is copied onto the ‘real’ HBase directory via distcp. The act of copying these files, creates new HDFS metadata, which is why a restore of the NameNode edits from the time of the HBase backup isn’t required for this kind of restore, because it’s a restore (via distcp) of a specific HDFS directory (i.e., the HBase part) not the entire HDFS file-system.
CopyTable: Copy table utility could either be used to copy data from one table to another on the same cluster, or to copy data to another table on another cluster.
Export: Export approach dumps the content of a table to HDFS on the same cluster.
8. When would you use HBase?
Ans: HBase is used in cases where we need random read and write operations and it can perform a number of operations per second on a large data sets.
HBase gives strong data consistency.
It can handle very large tables with billions of rows and millions of columns on top of commodity hardware cluster.
9. How does Hbase support Bulk data loading?
Ans: There are two main steps to do a data bulk load in Hbase.
Generate Hbase data file(StoreFile) using a custom mapreduce job) from the data source. The StoreFile is created in Hbase internal format which can be efficiently loaded.
The prepared file is imported using another tool like comletebulkload to import data into a running cluster. Each file gets loaded to one specific region.
10. What are the best reasons to choose HBase as DBMS for Hadoop?
Ans: Some of the properties of Apache HBase make it the best choice as DBMS for Hadoop. Here are some of them :
- It is ideal for handling large tables which can fit billions of rows and columns
- Users can read/write on the database on a real-time basis.
- Compatible with Hadoop as both of them are Java based.
- Vast operational support for CRUD operations.
11. How HBase version data?
Ans: When a piece of data is inserted/updated/deleted HBase will create a new version for that column.Actual deletion is happening only while compaction. If a particular cell exceeded a number of versions allowed, extra versions will be dropped during compaction.
12. consider a few scenarios when you will consider HBase?
Ans: When there is a need to shift an entire database, this approach is generally opted. In addition to this, during the data operations which are large to handle, Hbase can be consider. Moreover, when there are a lot of features such as inner joins and transactions maintenance need to be used frequently, the Hbase can be considered easily.
13. What is BlockCache?
Ans: HBase BlockCache is another data storage used in HBase. It is used to keep the most used data in JVM heap. The main purpose of such data storage is to provide access to data from HFiles to avoid disk reading. Each column family in HBase has its own BlockCache. Similarly, each Block in BlockCache represents the unit of data whereas an Hfile is a sequence of blocks with an index over those blocks.
14. Compare HBase vs HDFS?
Ans: HBase: Basically, it is built on top of the HDFS.
HDFS: Whereas, for storing large files, it is suitable.
- HBase: For larger tables, basically, it offers fast lookups.
- HDFS: However, HDFS does not offer fast lookups.
- HBase: It provides low latency access.
- HDFS : And, it provides high latency batch processing.
15. What are the Pros of HBase?
Ans: There are various advantages of HBase, like:
Large data sets:
It can easily handle as well as stores large datasets on top of HDFS file storage.
When relational databases breakdown at that time, HBase shine.
In HBase, data reading and processing will take the less amount of time.
Failover support and load sharing:
Since HDFS is internally distributed and automatically recovered and HBase runs on top of HDFS, so HBase is automatically recovered. And with the help of RegionServer replication, we have this failover facility.
In both linear and modular form, Scalability supports.
16. What is a cell in Hbase?
Ans: A cell in Hbase is the smallest unit of a Hbase table which holds a piece of data in the form of a tuple(row,column,version)
17. what is HMaster?
Ans: The Hmaster is the Master server responsible for monitoring all RegionServer instances in the cluster and it is the interface for all metadata changes. In a distributed cluster, it runs on the Namenode.
18. What is a Hbase Store?
Ans: A Habse Store hosts a MemStore and 0 or more StoreFiles (HFiles). A Store corresponds to a column family for a table for a given region.
19. What are the approaches to avoid hotspotting?
Ans: In Hbase values are always freighted with their coordinates; as a cell value passes through the system, it’ll be accompanied by its row, column name, and timestamp. If the rows and column names are large, especially compared to the size of the cell value, then indices that are kept on HBase storefiles (StoreFile (HFile)) to facilitate random access may end up occupying large chunks of the HBase allotted RAM than the data itself because the cell value coordinates are large.
20. What are the tombstones markers in HBase and how many tombstones markers are there in HBase?
Ans: When a user deletes a cell in HBase table, though it gets invisible in the table but remains in the server in the form of a marker, which is commonly known as tombstones marker. During compaction period the tombstones marker gets removed from the server.
There are three tombstones markers:
- Version delete
- Family delete
- Column delete
21. What is the column family? What is its purpose in HBase?
Ans: Column family is a key in HBase table that represents the logical deviation of data. It impacts the data storage format in HDFS. Hence, every HBase table must have at least one column family.
22. How is data written into HBase?
Ans: Data is written into HBase following several steps. First, when the user updates data in HBase table, it makes an entry to a commit log which is known as write-ahead log (WAL) in HBase. Next, the data is stored in the in-memory MemStore. If the data in the memory exceeds the maximum value, then it is flushed to the disk as HFile. Once the data is flushed users can discard the commit logs.
23. When would you not want to use HBase?
Ans: When the data access patterns are sequential over immutable data.
- When data is not large.
- When we can use alternative such as Hive.
- When we really require a relational query engine or a normalized schema.
24. What is HBase Bloom Filter?
Ans: This is the common HBase Interview Questions asked in an interview. An HBase Bloom Filter is an efficient mechanism to test whether a store file (When something is written to HBase, it is first written to an in-memory store, once this memstore reaches a certain size, it is flushed to disk into a store file) contains a specific row or row-col cell. Normally, the only way to decide if a row key is present in a store file is to check in file’s block index, which has the start row-key of each block in the store file. Bloom filters act as an in-memory data structure which helps to reduce disk reads to only the files likely to contain that row – Not all store files. So it acts like an in-memory index to indicate a probability of finding a row in a particular store file.
25. What are Hlog and HFile?
Ans: HLog is the write-ahead log file, also known as WAL and HFile is the real data storage file. Data is first written to the write-ahead log file and also written in MemStore.Once MemStore is full, the contents of the MemStore are flushed to the disk into HFiles.
26. Can you directly delete a call from the HBase?
Ans: No, it is not possible in most of the cases. When the users actually do so, the cells get invisible and remain present in the server in the form of a tombstone marker. They are generally removed by the compactions periods. The direct deleting doesn’t work in most of the cases.
27. What would be the best reasons to prefer Hbase as the DBMS according to you?
Ans: One of the best things about Hbase is it is scalable in all the aspects and modules. The users can simply make sure of catering a very large number of tables in a short time period. In addition to this, it has a vast support available for all the CRUD operations. It is capable to store more data and can manage the same simply. Also, the stores are column oriented and there are a very large number of rows and column available that enable users to keep the pace up all the time.
28. How can I troubleshoot my HBase cluster?
Ans: Always start with the master log (TODO: Which lines?). Normally it’s just printing the same lines over and over again. If not, then there’s an issue. Google or search-hadoop.com should return some hits for those exceptions you’re seeing.
An error rarely comes alone in Apache HBase, usually when something gets screwed up what will follow may be hundreds of exceptions and stack traces coming from all over the place. The best way to approach this type of problem is to walk the log up to where it all began, for example, one trick with RegionServers is that they will print some metrics when aborting so grapping for Dump should get you around the start of the problem.
RegionServer suicides are ‘normal’, as this is what they do when something goes wrong. For example, if ulimit and max transfer threads (the two most important initial settings, see [ulimit] and dfs.datanode.max.transfer.threads) aren’t changed, it will make it impossible at some point for DataNodes to create new threads that from the HBase point of view is seen as if HDFS was gone. Think about what would happen if your MySQL database was suddenly unable to access files on your local file system, well it’s the same with HBase and HDFS.
Another very common reason to see RegionServers committing seppuku is when they enter prolonged garbage collection pauses that last longer than the default ZooKeeper session.
29. What is the Hierarchy of Tables in Apache HBase?
Ans: The hierarchy for tables in HBase is as follows:
When a table is created, one or more column families are defined as high-level categories for storing data corresponding to an entry in the table. As is suggested by HBase being “column-oriented”, column family data for all table entries, or rows, are stored together For a given (row, column family) combination, multiple columns can be written at the time the data is written. Therefore, two rows in an HBase table need not necessarily share the same columns, only column families. For each (row, column-family, column) combination HBase can store multiple cells, with each cell associated with a version, or timestamp corresponding to when the data was written. HBase clients can choose to only read the most recent version of a given cell, or read all versions.
30. What is S3?
Ans: S3 stands for simple storage service and it is a one of the file system used by hbase.
31. Define MapReduce?
Ans: MapReduce as a process was designed to solve the problem of processing in excess of terabytes of data in a scalable way.
32. What are the features of Apache HBase?
Ans: Linear and modular scalability.
- Strictly consistent reads and writes.
- Automatic and configurable sharding of tables
- Automatic failover support between RegionServers.
- Convenient base classes for backing Hadoop MapReduce jobs with Apache HBase tables.
- Easy to use Java API for client access.
- Block cache and Bloom Filters for real-time queries.
- Query predicate push down via server side Filters
- Thrift gateway and an REST-ful Web service that supports XML, Protobuf, and binary data encoding options.
33. How many tombstone markers are there in the HBase? Name them?
Ans: There are total 3 tombstone markers which you can consider anytime. They are Version delete, Family delete and Column Delete.
34. How can you say that the Hbase is capable to offer high availability?
Ans: There is a special feature known as region replication. There are several replicas available that define the entire region in a table. It is the load balancer in the Hbase which simply make sure that the replicas are not hosted again and again in the servers with similar regions. This is exactly what that makes sure of the high availability of Hbase all the time.
35. Is Hbase an OS independent approach?
Ans: Yes, it is totally independent of the operating system and the users are free to consider it on Windows, Linux, Unix etc. the only basic requirement is it should have a Java support installed on it.
36. What is standalone mode in the Hbase?
Ans: When the users don’t need the Hbase to use the HDFS, this mode can be turned on. It is basically a default mode in the Hbase and the users are generally free to use it anytime they want. Instead of HDFS, the Hbase make use of a file system when this mode is activated by the user. It is possible to save a lot of time while enabling this mode during the performance of some important tasks. It is also possible to apply or to remove various time restrictions on the data during this mode.
37. Define TTL in Hbase?
Ans: It is basically a technique that is useful when it comes to data retention. It is possible for the users to preserve the version of a cell for a defined time period. The same get deleted automatically upon the completion of such a time.
38. What are the different compaction types in Hbase?
Ans: There are two types of compaction. Major and Minor compaction. In minor compaction, the adjacent small HFiles are merged to create a single HFile without removing the deleted HFiles. Files to be merged are chosen randomly.
39. What is a rowkey in Hbase?
Ans: Each row in Hbase is identified by a unique byte of array called row key.
40. What is HRegionServer in Hbase?
Ans: HRegionServer is the RegionServer implementation. It is responsible for serving and managing regions. In a distributed cluster, a RegionServer runs on a DataNode.
41. When do we do manula Region splitting?
Ans: The manual region splitting is done we have an unexpected hotspot in your table because of many clients querying the same table.
42. Why do we pre-create empty regions?
Ans: Tables in HBase are initially created with one region by default. Then for bulk imports, all clients will write to the same region until it is large enough to split and become distributed across the cluster. So empty regions are created to make this process faster.
43. What is Row Key?
Ans: RowKey is a unique identifier in an HBase which is used to logically group the table cells to ensure that all cells with the similar RowKeys are placed on the same server. However, internally RowKey is a byte array.
44. Mention about a few scenarios when you should consider HBase as a database?
Ans: Here are a few scenarios:
- When we need to shift an entire database
- To handle large data operations
- When we need frequent inner joins
- When frequent transaction maintenance is a need.
- When we need to handle variable schema.
- When an application demands for key-based data retrieval
- When the data is in the form of collections.
45. What is HFile?
Ans: The HFile is the HBase underlying storage format. Each HFile belongs to a column family whereas a column family may have multiple HFiles. However, a single HFile never contain data for multiple column families.
46. What are the data manipulation commands of HBase?
Ans: Data Manipulation commands of HBase are:
put: Puts a cell value at a specified column in a specified row in a particular table.
get: Fetches the contents of a row or a cell.
delete: Deletes a cell value in a table.
deleteall: Deletes all the cells in a given row.
scan: Scans and returns the table data.
count: Counts and returns the number of rows in a table.
truncat: Disables, drops, and recreates a specified table.
47. Explain the data model of HBase?
Ans: HBase comprises of:
Set of tables.
Each table consists of column families and rows.
Row key acts as a Primary key in HBase.
Any access to HBase tables uses this Primary Key.
Each column qualifier present in HBase denotes attributes corresponding to the object which resides in the cell.
48. What happens when you issue a delete command in HBase?
Ans: Once you issue a delete command in HBase for cell, column or column family, it is not deleted instantly. A tombstone marker in inserted. Tombstone is a specified data, which is stored along with standard data. This tombstone makes hides all the deleted data.
The actual data is deleted at the time of major compaction. In Major compaction, HBase merges and recommits the smaller HFiles of a region to a new HFile. In this process, the same column families are placed together in the new HFile. It drops deleted and expired cell in this process. All the results from scan and get filters the deleted cells.
49. What is Nagios?
Ans: Nagios is a very commonly used support tool for gaining qualitative data regarding cluster status. It polls current metrics on a regular basis and compares them with given thresholds.
50. Which filter accepts the pagesize as the parameter in hBase?
Ans: PageFilter accepts the pagesize as the parameter. Implementation of Filter interface that limits results to a specific page size. It terminates scanning once the number of filter-passed the rows greater than the given page size.
51. Define the difference between Hive and HBase?
Ans: Apache Hive is a data warehousing infrastructure built on top of Hadoop. It helps in querying data stored in HDFS for analysis using Hive Query Language (HQL), which is a SQL-like language, that gets translated into MapReduce jobs. Hive performs batch processing on Hadoop. Apache HBase is NoSQL key/value store which runs on top of HDFS. Unlike Hive, HBase operations run in real-time on its database rather than MapReduce jobs. HBase partitions the tables, and the tables are further splitted into column families.
Hive and HBase are two different Hadoop based technologies – Hive is an SQL-like engine that runs MapReduce jobs, and HBase is a NoSQL key/value database of Hadoop. We can use them together. Hive can be used for analytical queries while HBase for real-time querying. Data can even be read and written from HBase to Hive and vice-versa.
52. What do you know about an Hfile and with whom it is actually related to in an HBase?
Ans: It is basically a defined storage format for the HBase and generally, it is related to a column family. There is no strict upper limit on them in the column families. The users can easily deploy an Hfile for storing data that belong to different families.
53. Define Thrift?
Ans: In C++, Apache Thrift is written, but for many programming languages, it offers schema compilers, which includes Java, C++, Perl, PHP, Python, Ruby, and more.
54. Can you explain data versioning?
Ans: In addition to being a schema-less database, HBase is also versioned.
Every time you perform an operation on a cell, HBase implicitly stores a new version. Creating, modifying and deleting a cell are all treated identically, they are all new versions. When a cell exceeds the maximum number of versions, the extra records are dropped during the major compaction.
Instead of deleting an entire cell, you can operate on a specific version within that cell. Values within a cell are versioned and it is identified the timestamp. If a version is not mentioned, then the current timestamp is used to retrieve the version. The default number of cell version is three.
55. What is HBaseFsck class?
Ans: There is a tool name hbck is available in HBase, which is implemented by the HBaseFsck class. Basically, it offers several command-line switches that influence its behavior.
56. What is a Bloom filter and how does it help in searching rows?
Ans: HBase supports Bloom Filter to improve the overall throughput of the cluster. A HBase Bloom Filter is a space efficient mechanism to test whether a HFile contains a specific row or row-col cell.Without Bloom Filter, the only way to decide if a row key is present in a HFile is to check the HFile’s block index, which stores the start row key of each block in the HFile. There are many rows drops between the two start keys. So, HBase has to load the block and scan the block’s keys to figure out if that row key actually exists.
57. Which method is used to access HFile directly without using HBase?
Ans: In order to access HFile directly without using HBase, we use HFile.main() method. Learn HBase Operation for more detailed about HFile.