Basic Understanding of HBase

In this post, I am keeping some bullet points which consists of frequently used terms (in blue color) in HBase. So, it will give you an overall idea of HBase.

  • HBase is a NoSQL column-oriented database. Columns in HBase table are grouped into column families. HBase table contains one or more column families. Each row in a table identified by row key. Each column is identified by a qualifying name i.e Column Family Name + Column Name.
  • Row keys are unique in a Table and are always treated as a byte[].
  • There’s no design-time way to specify row keys.
  • HBase is an open-source implementation of Google’s Big Table architecture.
  • HBase is schema-less.
  • HBase tables are broken up into horizontal partitions called regions. Regions are subset of the table’s data as collection of rows and distributed across HDFS DataNodes.

  • Regions are managed by Region Server. Region Servers are collocated with the HDFS DataNodes, which enable data locality (putting the data close to where it is needed) for the data served by the Region Servers. One Region Server manage one or more regions.
  • Column families are stored in separate HFiles in HDFS (data files). HBase uses multiple HFiles per column family.
  • HBase maintains two memory cache: the “memory store” and the “block cache”.
  • Memory store, implemented as the MemStore, accumulates data change/edits as they’re received, buffering them in memory to the store data (HFiles). There is one MemStore per column family i.e per HFile. MemStore creates multiple small store files (HFile files) over time when MemStore accumulates enough data (max size) and flushing to disk.
  • The block cache, an implementation as the BlockCache, keeps data blocks resident in memory after they’re read from HDFS for a configurable amount of time.
  • Memory store is best to sort records before writing data into disk.
  • Already existing records in HFiles (data files) which are not present in memory store, MemStore (memory cache) which holding recent changes of data and block cache which keeping recently read data are common sources to get updated records in HBase read.
  • HLog aka Write-ahead Log (WAL), keeps edit/change log history of data for all regions. If RegionServer is crushed, HLog/WAL is used to replay these recent edits for recovery. There is a certain limits for HLog/WAL size, which when reached cause MemStore to flush to disk file (HFile). So, we don’t need to keep in HLog/WAL edits log which were already written to HFiles. The HLog/WAL is in HDFS in /hbase/.logs/ with subdirectories per region.
  • Any change (Put/Delete) request from Client will first write to HLog/WAL file by RegionServer and then write it to in-memory memstore which after be written to HFile (HDFS data file).
  • Rows in HBase are sorted lexicographically (alphabetical order) by row key. However, poorly designed row keys are a common source of hotspotting. Hotspotting occurs when a large amount of client traffic is directed at one node. There are two hotspotting scenario, hotspotting at read time and hotspotting at write time. Salting row key will solve both write and read hotspotting problem. Read operation will happen parallel to multiple Region Servers.
  • There is a special HBase table called the META table, which holds the location of the regions in the cluster. This META table is also an HBase table reside one of the RegionServer. ZooKeeper stores the location of the META table.
  • A Cell stores data and is essentially identified by – {rowkey + Column Family + Column Identifier}. The data stored in a Cell is called its value.
  • HBase maintain stored data version and are identified by the timestamp. The number of versions of data retained in a column family is configurable and this value by default is 3.
  • There is a open source project OpenTSDB which uses the notion of rowkeys bucketed by time. It is generally used for infrastructure monitoring.

Important Note: Rows in HBase are sorted lexicographically/alphabetically order on disk by row-key. HBase is optimized for reads when data is queried on basis of row-key. Use of row-key is very crucial for your query requirement to get optimal query performance. Else, it’ll end up being a full table scan i.e scan million of records.  Additional note, that HBase doesn’t have indexes on column qualifiers (aka column names) so a scan that does not relay on the row-key is not efficient. Also, poorly designed row-keys are a common source of hotspotting which we have discussed above. You can find the basic, how HBase scan/query works below:

  • By row-key or range of row-keys
    – Only touch the Region Servers with those ids and only touch Regions with those ids.
    – If no further criteria provided, it will load all the cells for all of the column families which means loading multiple data files (HFiles) from HDFS.
  • By Column Family
    – Eliminates the needs to load other storage data files from HDFS. Load only specific data files for the specified column family.
  • By Timestamp/Version
    – Can skip scan entire data file (HFile) as the contained timestamp range within a file is considered.
  • By Column Name/Qualifier
    – This will however eliminate the need to transfer the unwanted data from Region Server to Client.
  • By Column Value
    – Can skip cells by using filters (Some filters available but you can create your custom filters).
    – The slowest selection criteria.
    – Will examine each cell.

One thought on “Basic Understanding of HBase

Leave a comment