Big Data Definition¶
Big Data is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.
-
Old model: Few companies are generating data, all others are consuming data
-
New model: all of us are generating and consuming data at the same time
- Socail media
- Scientific instruments
- Mobile devices
- Sensors
3 vs¶
Since the CPU techonology's improvement getting slower these years, the only way to scale up is distributed computation.
Volume¶
Data volume is increasing exponentially
Velocity ( Streaming data )¶
- Data is being generated fast and need to be processed fast
-
Static data Streaming data
-
Online Data Analytics
- Late decisions 🡺 missing opportunities
Variety¶
- Various formats, types, and structures
- Numerical, text, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
- A single application can be generating/collecting many types of data
The frustration of paraller programming¶
- Race conditions
- Deadlock
-
Hard to debug: Race conditions and deadlocks are nondeterministic
-
Most programming languages are low-level
- The programmer needs to manage shared memory and/or communication
- OpenMP is a good step forward, but still difficult for most programmers
-
Programs written for multi-cores do not easily carry over to clusters
The structure spectrum¶
Structured (schema-first) Relational database -- Semi-structured(Schema-later)Documents XML -- Unstucture data (MongoDB)
Structured data¶
Modify the rows is easy but modify the column costs a lot.
-
The relational data model is the most used data model
-
Every relation has a schema defining each columns' type
-
The programmer must statically specify the schema
Semi-structured Data¶
Json object consists of a collection name: value pairs, seperated by commas.
Each value can be - A string - A number - A boolean - null - An array - a JSON object
How to handle big data¶
- Race conditions ( use locks )
The frustration of parallel programming¶
-
Hard to debug
-
How to migrate from one archetecture to another
Cloud comuting¶
- Dynamic provisioning
- Scalability
- Elasticity
Mapreduce¶
Map: Takes raw input and produces a key, value pair
Reduce: Takes data with same key and produces outputs
Shuffling and sorting - Hidden phase between mappers and reducers - Groups all key value pairs
Example; Inverted index¶
Map
For each (url, doc) pair
Emit (keyword, url) for each keyword in doc
Reduce
For each keyword, output (keyword, list of urls)
Example: translate this SQL query into map reduce¶
SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980
Answer:
-
For each record in the ‘Authors’ table:
- Map: Emit (AuthorID, AuthorName)
-
For each record in the ‘Books’ table:
- Map: Emit (AuthorID, Date)
Reduce:
- For each AuthorID, if Date>1980, output AuthorName
Where do we store data¶
Distributed store. - I/O is a bottleneck - Building a high-end supercomputer is very costly - Storing all data in one place adds the risk of hardware failures
Target environment¶
-
Thousands of computers
-
Distributed
-
Failures are the norm
-
Files are huge, but not many
- greater than 100M, usually multi-gigabyte
-
Read/write characteristics (write-once, read-many)
- Files are mutated by appending
-
Once written, files are typically only read
-
Large streaming reads and small random reads are typical
-
I/O bandwidth is more important than latency
GFS design decisions¶
-
Files stored into chunks
-
Reliability through replication
-
Single master to coordinate access, keep metadata
HDFS¶
an open-source implementation for Google's MapReduce and GFS
-
Clean and simple programming abstraction
-
Automatic parallelization & distribution
-
Fault tolerance and automatic recovery
Hadoop architecture¶
Single master node, many slave nodes
-
Distributed file system (HDFS)
-
Execution engine (MapReduce)
HDFS details¶
Name node: Maintains metadata info about files
- Maps a filename to a set of blocks
- Maps a block to the data nodes where it resides
- replication engine for blocks
Datanode
- Store data
- Files are divided into blocks
- Communicates with name nodes through periodic heartbeat
Replication engine
- Upon detecting a DataNode failure, will choose new DataNode for replicas.
- Balance disk usage
- Balance communication traffic to DataNodes
M data cells and n parity cells
Storage efficiency = \(\frac{m}{m+n}\)
Namenode failure¶
1) What is the Single point of failure in Hadoop v1?
- The single point of failure in Hadoop v1 is NameNode. If NameNode gets fail the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because NameNode is only the point of contact to all DataNodes and if the NameNode fails all communication will stop.
2) What are the available solutions to handle single point of failure in Hadoop 1?
- To handle the single point of failure, we can use another setup configuration which can backup NameNode metadata. If the primary NameNode will fail our setup can switch to secondary (backup) and no any type to shutdown will happen for Hadoop cluster.
3) How Single point of failure issue has been addressed in Hadoop 2?
- HDFS High Availability of Namenode is introduced with Hadoop 2. In this two separate machines are getting configured as NameNodes, where one NameNode always in working state and anther is in standby. Working Name node handling all clients request in the cluster where standby is behaving as the slave and maintaining enough state to provide a fast failover on Working Name node.
On worker ailure¶
-
Detect failure via periodic heartbeats
- Workers send heartbeat messages (ping) periodically to the master node
-
Re-execute completed and in-progress map tasks
- Re-execute in-progress reduce tasks
- Task completion committed through maste
Mapreduce: A major step backwards¶
-
Mapreduce may be a good idea for writing certain types of computations
-
A giant step backward in the programming paradigm for large-scale data intensive applications
-
A sub-cptimal implementation, in that it uses brute force instead of indexing
-
Missing most of features that are routinely included in current DMBS
-
Not novel at all