Big Data Definition¶

Big Data is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.

Old model: Few companies are generating data, all others are consuming data
New model: all of us are generating and consuming data at the same time
- Socail media
- Scientific instruments
- Mobile devices
- Sensors

3 vs¶

Since the CPU techonology's improvement getting slower these years, the only way to scale up is distributed computation.

Volume¶

Data volume is increasing exponentially

Velocity ( Streaming data )¶

Data is being generated fast and need to be processed fast
Static data Streaming data
Online Data Analytics
Late decisions 🡺 missing opportunities

Variety¶

Various formats, types, and structures
Numerical, text, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
A single application can be generating/collecting many types of data

The frustration of paraller programming¶

Race conditions
Deadlock
Hard to debug: Race conditions and deadlocks are nondeterministic
Most programming languages are low-level
- The programmer needs to manage shared memory and/or communication
- OpenMP is a good step forward, but still difficult for most programmers
Programs written for multi-cores do not easily carry over to clusters

The structure spectrum¶

Structured (schema-first) Relational database -- Semi-structured(Schema-later)Documents XML -- Unstucture data (MongoDB)

Structured data¶

Modify the rows is easy but modify the column costs a lot.

The relational data model is the most used data model
Every relation has a schema defining each columns' type
The programmer must statically specify the schema

Semi-structured Data¶

Json object consists of a collection name: value pairs, seperated by commas.

Each value can be - A string - A number - A boolean - null - An array - a JSON object

How to handle big data¶

Race conditions ( use locks )

The frustration of parallel programming¶

Hard to debug
How to migrate from one archetecture to another

Cloud comuting¶

Dynamic provisioning
Scalability
Elasticity

Mapreduce¶

Map: Takes raw input and produces a key, value pair

Reduce: Takes data with same key and produces outputs

Shuffling and sorting - Hidden phase between mappers and reducers - Groups all key value pairs

Example; Inverted index¶

Map

For each (url, doc) pair

Emit (keyword, url) for each keyword in doc

Reduce

For each keyword, output (keyword, list of urls)

Example: translate this SQL query into map reduce¶

SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980

Answer:

For each record in the ‘Authors’ table:
- Map: Emit (AuthorID, AuthorName)
For each record in the ‘Books’ table:
- Map: Emit (AuthorID, Date)

Reduce:

For each AuthorID, if Date>1980, output AuthorName

Where do we store data¶

Distributed store. - I/O is a bottleneck - Building a high-end supercomputer is very costly - Storing all data in one place adds the risk of hardware failures

Target environment¶

Thousands of computers
Distributed
Failures are the norm
Files are huge, but not many
- greater than 100M, usually multi-gigabyte
Read/write characteristics (write-once, read-many)
- Files are mutated by appending
- Once written, files are typically only read
- Large streaming reads and small random reads are typical
I/O bandwidth is more important than latency

GFS design decisions¶

Files stored into chunks
Reliability through replication
Single master to coordinate access, keep metadata

HDFS¶

an open-source implementation for Google's MapReduce and GFS

Clean and simple programming abstraction
Automatic parallelization & distribution
Fault tolerance and automatic recovery

Hadoop architecture¶

Single master node, many slave nodes

Distributed file system (HDFS)
Execution engine (MapReduce)

HDFS details¶

Name node: Maintains metadata info about files

Maps a filename to a set of blocks
Maps a block to the data nodes where it resides
replication engine for blocks

Datanode

Store data
Files are divided into blocks
Communicates with name nodes through periodic heartbeat

Replication engine

Upon detecting a DataNode failure, will choose new DataNode for replicas.
Balance disk usage
Balance communication traffic to DataNodes

M data cells and n parity cells

Storage efficiency = \(\frac{m}{m+n}\)

Namenode failure¶

1) What is the Single point of failure in Hadoop v1?

The single point of failure in Hadoop v1 is NameNode. If NameNode gets fail the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because NameNode is only the point of contact to all DataNodes and if the NameNode fails all communication will stop.

2) What are the available solutions to handle single point of failure in Hadoop 1?

To handle the single point of failure, we can use another setup configuration which can backup NameNode metadata. If the primary NameNode will fail our setup can switch to secondary (backup) and no any type to shutdown will happen for Hadoop cluster.

3) How Single point of failure issue has been addressed in Hadoop 2?

HDFS High Availability of Namenode is introduced with Hadoop 2. In this two separate machines are getting configured as NameNodes, where one NameNode always in working state and anther is in standby. Working Name node handling all clients request in the cluster where standby is behaving as the slave and maintaining enough state to provide a fast failover on Working Name node.

On worker ailure¶

Detect failure via periodic heartbeats
- Workers send heartbeat messages (ping) periodically to the master node
Re-execute completed and in-progress map tasks
Re-execute in-progress reduce tasks
Task completion committed through maste

Mapreduce: A major step backwards¶

Mapreduce may be a good idea for writing certain types of computations
A giant step backward in the programming paradigm for large-scale data intensive applications
A sub-cptimal implementation, in that it uses brute force instead of indexing
Missing most of features that are routinely included in current DMBS
Not novel at all