HBase¶
-
An open-source version of Google BigTable
-
HBase is a distributed column-oriented data store built on top of HDFS
-
HBase is an Apache open source project whose goal is to provide storage with read-write capabilities for Hadoop
-
Data is logically organized into tables, rows and columns
HDFS and HBase¶
Both are distributed systems that scale to hundreds or thousands of nodes
However
HDFS is good for batch processing, scans over big files
- Not good for record lookup
- For incremental addition of small batches
- Not good for updates
HBase is designed to efficiently address the above points - Fast record lookup - Support for record-level insertion - Support for updates (not in place)
HBase updates are done by creating new versions of values
HBase Data Model¶
-
Key-Value Pairs
-
Data is all byte[] in HBase
-
Implicit Primary key in RDBMS terms
-
Different rows may have different sets of columns
-
A single cell may have different values at different timestamps
-
Each row has a key
-
Each record is divided into column families
-
Each column family consists of one or more columns
Key - Byte array - Serves as the primary key for the table - Indexed far fast lookup
Column Family - Has a name (string) - Contains one or more related columns
Column - Belongs to one column family - Included inside the row familyName:columnName
Version Number - Unique within each key - By default🡪 System’s timestamp - Data type is Long
Value (Cell) - Byte array
HBase schema consists of several Tables
Each table consists of a set of Column Families - Columns are not part of the schema
HBase has Dynamic Columns - Because column names are encoded inside the cells - Different cells can have different columns
HBase physical model¶
-
Each column family is stored in a separate file (called HTables)
-
Key & Version numbers are replicated with each column family
-
Empty cells are not stored
HBase Regions¶
Each HTable (column family) is partitioned horizontally into regions
Regions are counterpart to HDFS blocks
Hbase components¶
Region - A subset of a table’s rows, like horizontal range partitioning - Automatically done
RegionServer (many slaves) - Manages data regions - Serves data for reads and writes (using a log)
Master - Responsible for coordinating the slaves - Assigns regions, detects failures - Admin functions
MongoDB¶
Started in 2007
Targeting semi-structured data in JSON
Designed to be easy to “scale out”
Good support for indexing, partitioning, replication
ACID transactions
Nice integration in Web development stacks
Not-so-great support for joins (or complex queries)
Advantages of the document model¶
-
Documents are flexible. You can modify your schema at any time, allowing you to continuously integrate new application functionality, without wrestling with complex schema migrations.
-
Documents are polymorphic: Documents in a collection can have different structures compared to other documents in the same collection.
-
Documents are extensible: You can model data in any way your application demands it – from rich, hierarchical documents through to flat, table-like structures, simple key-value pairs, text, geospatial data, and the nodes and edges used in graph processing.
Scalability¶
-
Through sharding, you can automatically scale your database out across multiple nodes to handle write-intensive workloads and growing data sizes
-
Hashed Sharding. Documents are distributed according to an MD5 hash of the shard key value. Guarantees a uniform distribution of writes across shards.
-
Ranged Sharding.
- Documents are partitioned across shards according to the shard key value.
- Well suited for applications that need to optimize range based queries, such as co-locating data for all customers in a specific region on a specific shard.
MongoDB vs HBase¶