HBase¶

An open-source version of Google BigTable
HBase is a distributed column-oriented data store built on top of HDFS
HBase is an Apache open source project whose goal is to provide storage with read-write capabilities for Hadoop
Data is logically organized into tables, rows and columns

HDFS and HBase¶

Both are distributed systems that scale to hundreds or thousands of nodes

However

HDFS is good for batch processing, scans over big files

Not good for record lookup
For incremental addition of small batches
Not good for updates

HBase is designed to efficiently address the above points - Fast record lookup - Support for record-level insertion - Support for updates (not in place)

HBase updates are done by creating new versions of values

HBase Data Model¶

Key-Value Pairs
Data is all byte[] in HBase
Implicit Primary key in RDBMS terms
Different rows may have different sets of columns
A single cell may have different values at different timestamps
Each row has a key
Each record is divided into column families
Each column family consists of one or more columns

Key - Byte array - Serves as the primary key for the table - Indexed far fast lookup

Column Family - Has a name (string) - Contains one or more related columns

Column - Belongs to one column family - Included inside the row familyName:columnName

Version Number - Unique within each key - By default🡪 System’s timestamp - Data type is Long

Value (Cell) - Byte array

HBase schema consists of several Tables

Each table consists of a set of Column Families - Columns are not part of the schema

HBase has Dynamic Columns - Because column names are encoded inside the cells - Different cells can have different columns

HBase physical model¶

Each column family is stored in a separate file (called HTables)
Key & Version numbers are replicated with each column family
Empty cells are not stored

HBase Regions¶

Each HTable (column family) is partitioned horizontally into regions

Regions are counterpart to HDFS blocks

Hbase components¶

Region - A subset of a table’s rows, like horizontal range partitioning - Automatically done

RegionServer (many slaves) - Manages data regions - Serves data for reads and writes (using a log)

Master - Responsible for coordinating the slaves - Assigns regions, detects failures - Admin functions

MongoDB¶

Started in 2007

Targeting semi-structured data in JSON

Designed to be easy to “scale out”

Good support for indexing, partitioning, replication

ACID transactions

Nice integration in Web development stacks

Not-so-great support for joins (or complex queries)

Advantages of the document model¶

Documents are flexible. You can modify your schema at any time, allowing you to continuously integrate new application functionality, without wrestling with complex schema migrations.
Documents are polymorphic: Documents in a collection can have different structures compared to other documents in the same collection.
Documents are extensible: You can model data in any way your application demands it – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes and edges used in graph processing.

Scalability¶

Through sharding, you can automatically scale your database out across multiple nodes to handle write-intensive workloads and growing data sizes
Hashed Sharding. Documents are distributed according to an MD5 hash of the shard key value. Guarantees a uniform distribution of writes across shards.
Ranged Sharding.
- Documents are partitioned across shards according to the shard key value.
- Well suited for applications that need to optimize range based queries, such as co-locating data for all customers in a specific region on a specific shard.

MongoDB vs HBase¶

截屏2020-12-15 上午12.08.26.png