Skip to content

HBase

  • An open-source version of Google BigTable

  • HBase is a distributed column-oriented data store built on top of HDFS

  • HBase is an Apache open source project whose goal is to provide storage with read-write capabilities for Hadoop

  • Data is logically organized into tables, rows and columns

HDFS and HBase

Both are distributed systems that scale to hundreds or thousands of nodes

However

HDFS is good for batch processing, scans over big files

  • Not good for record lookup
  • For incremental addition of small batches
  • Not good for updates

HBase is designed to efficiently address the above points - Fast record lookup - Support for record-level insertion - Support for updates (not in place)

HBase updates are done by creating new versions of values

HBase Data Model

  • Key-Value Pairs

  • Data is all byte[] in HBase

  • Implicit Primary key in RDBMS terms

  • Different rows may have different sets of columns

  • A single cell may have different values at different timestamps

  • Each row has a key

  • Each record is divided into column families

  • Each column family consists of one or more columns

Key - Byte array - Serves as the primary key for the table - Indexed far fast lookup

Column Family - Has a name (string) - Contains one or more related columns

Column - Belongs to one column family - Included inside the row familyName:columnName

Version Number - Unique within each key - By default🡪 System’s timestamp - Data type is Long

Value (Cell) - Byte array

HBase schema consists of several Tables

Each table consists of a set of Column Families - Columns are not part of the schema

HBase has Dynamic Columns - Because column names are encoded inside the cells - Different cells can have different columns

HBase physical model

  • Each column family is stored in a separate file (called HTables)

  • Key & Version numbers are replicated with each column family

  • Empty cells are not stored

HBase Regions

Each HTable (column family) is partitioned horizontally into regions

Regions are counterpart to HDFS blocks

Hbase components

Region - A subset of a table’s rows, like horizontal range partitioning - Automatically done

RegionServer (many slaves) - Manages data regions - Serves data for reads and writes (using a log)

Master - Responsible for coordinating the slaves - Assigns regions, detects failures - Admin functions

MongoDB

Started in 2007

Targeting semi-structured data in JSON

Designed to be easy to “scale out”

Good support for indexing, partitioning, replication

ACID transactions

Nice integration in Web development stacks

Not-so-great support for joins (or complex queries)

Advantages of the document model

  • Documents are flexible. You can modify your schema at any time, allowing you to continuously integrate new application functionality, without wrestling with complex schema migrations.

  • Documents are polymorphic: Documents in a collection can have different structures compared to other documents in the same collection.

  • Documents are extensible: You can model data in any way your application demands it – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes and edges used in graph processing.

Scalability

  • Through sharding, you can automatically scale your database out across multiple nodes to handle write-intensive workloads and growing data sizes

  • Hashed Sharding. Documents are distributed according to an MD5 hash of the shard key value. Guarantees a uniform distribution of writes across shards.

  • Ranged Sharding.

    • Documents are partitioned across shards according to the shard key value.
    • Well suited for applications that need to optimize range based queries, such as co-locating data for all customers in a specific region on a specific shard.

MongoDB vs HBase

截屏2020-12-15 上午12.08.26.png