Skip to content

Big Data Definition

Big Data is data whose scale, complexity, and speed require new architecture, techniques, algorithms, and analytics to manage it and extract value and hidden knowledge from it.

  • Old model: Few companies are generating data, all others are consuming data

  • New model: all of us are generating and consuming data at the same time

    • Socail media
    • Scientific instruments
    • Mobile devices
    • Sensors

3 vs

Since the CPU techonology's improvement getting slower these years, the only way to scale up is distributed computation.

Volume

Data volume is increasing exponentially

Velocity ( Streaming data )

  • Data is being generated fast and need to be processed fast
  • Static data Streaming data

  • Online Data Analytics

  • Late decisions 🡺 missing opportunities

Variety

  • Various formats, types, and structures
  • Numerical, text, images, audio, video, sequences, time series, social media data, multi-dim arrays, etc…
  • A single application can be generating/collecting many types of data

The frustration of paraller programming

  • Race conditions
  • Deadlock
  • Hard to debug: Race conditions and deadlocks are nondeterministic

  • Most programming languages are low-level

    • The programmer needs to manage shared memory and/or communication
    • OpenMP is a good step forward, but still difficult for most programmers
  • Programs written for multi-cores do not easily carry over to clusters

The structure spectrum

Structured (schema-first) Relational database -- Semi-structured(Schema-later)Documents XML -- Unstucture data (MongoDB)

Structured data

Modify the rows is easy but modify the column costs a lot.

  • The relational data model is the most used data model

  • Every relation has a schema defining each columns' type

  • The programmer must statically specify the schema

Semi-structured Data

Json object consists of a collection name: value pairs, seperated by commas.

Each value can be - A string - A number - A boolean - null - An array - a JSON object

How to handle big data

  • Race conditions ( use locks )

The frustration of parallel programming

  • Hard to debug

  • How to migrate from one archetecture to another

Cloud comuting

  • Dynamic provisioning
  • Scalability
  • Elasticity

Mapreduce

Map: Takes raw input and produces a key, value pair

Reduce: Takes data with same key and produces outputs

Shuffling and sorting - Hidden phase between mappers and reducers - Groups all key value pairs

Example; Inverted index

Map

For each (url, doc) pair

Emit (keyword, url) for each keyword in doc

Reduce

For each keyword, output (keyword, list of urls)

Example: translate this SQL query into map reduce

SELECT AuthorName FROM Authors, Books WHERE Authors.AuthorID=Books.AuthorID AND Books.Date>1980

Answer:

  • For each record in the ‘Authors’ table:

    • Map: Emit (AuthorID, AuthorName)
  • For each record in the ‘Books’ table:

    • Map: Emit (AuthorID, Date)

Reduce:

  • For each AuthorID, if Date>1980, output AuthorName

Where do we store data

Distributed store. - I/O is a bottleneck - Building a high-end supercomputer is very costly - Storing all data in one place adds the risk of hardware failures

Target environment

  • Thousands of computers

  • Distributed

  • Failures are the norm

  • Files are huge, but not many

    • greater than 100M, usually multi-gigabyte
  • Read/write characteristics (write-once, read-many)

    • Files are mutated by appending
    • Once written, files are typically only read

    • Large streaming reads and small random reads are typical

  • I/O bandwidth is more important than latency

GFS design decisions

  • Files stored into chunks

  • Reliability through replication

  • Single master to coordinate access, keep metadata

HDFS

an open-source implementation for Google's MapReduce and GFS

  • Clean and simple programming abstraction

  • Automatic parallelization & distribution

  • Fault tolerance and automatic recovery

Hadoop architecture

Single master node, many slave nodes

  1. Distributed file system (HDFS)

  2. Execution engine (MapReduce)

HDFS details

Name node: Maintains metadata info about files

  • Maps a filename to a set of blocks
  • Maps a block to the data nodes where it resides
  • replication engine for blocks

Datanode

  • Store data
  • Files are divided into blocks
  • Communicates with name nodes through periodic heartbeat

Replication engine

  • Upon detecting a DataNode failure, will choose new DataNode for replicas.
  • Balance disk usage
  • Balance communication traffic to DataNodes

M data cells and n parity cells

Storage efficiency = \(\frac{m}{m+n}\)

Namenode failure

1) What is the Single point of failure in Hadoop v1?

  • The single point of failure in Hadoop v1 is NameNode. If NameNode gets fail the whole Hadoop cluster will not work. Actually, there will not any data loss only the cluster work will be shut down, because NameNode is only the point of contact to all DataNodes and if the NameNode fails all communication will stop.

2) What are the available solutions to handle single point of failure in Hadoop 1?

  • To handle the single point of failure, we can use another setup configuration which can backup NameNode metadata. If the primary NameNode will fail our setup can switch to secondary (backup) and no any type to shutdown will happen for Hadoop cluster.

3) How Single point of failure issue has been addressed in Hadoop 2?

  • HDFS High Availability of Namenode is introduced with Hadoop 2. In this two separate machines are getting configured as NameNodes, where one NameNode always in working state and anther is in standby. Working Name node handling all clients request in the cluster where standby is behaving as the slave and maintaining enough state to provide a fast failover on Working Name node.

On worker ailure

  • Detect failure via periodic heartbeats

    • Workers send heartbeat messages (ping) periodically to the master node
  • Re-execute completed and in-progress map tasks

  • Re-execute in-progress reduce tasks
  • Task completion committed through maste

Mapreduce: A major step backwards

  • Mapreduce may be a good idea for writing certain types of computations

  • A giant step backward in the programming paradigm for large-scale data intensive applications

  • A sub-cptimal implementation, in that it uses brute force instead of indexing

  • Missing most of features that are routinely included in current DMBS

  • Not novel at all