Friday, April 19, 2013

MapReduce - Part I


Mapreduce is a paradigm for distributed computing. It provides a framework for parallel computing. MapReduce enables large scale distributed data processing. MapReduce can be applied to many large scale computing problems. The name MapReduce is inspired from Map and Reduce functions in LISP programming language.But it is not an implementation of this lisp functions.

From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.

Typical problems solved by MapReduce:
Read a lot of data
Map: extract something you care about from each record.
Shuffle and sort - MapReduce framework will read output from Map phase, and perform sort and grouping.

Reduce: This phase aggregate, summarize, filter or transform data and write the results to file.


Why do we need distributed computing like MapReduce

Otherwise some problems are too big to solve

Example: 

20+ billion web pages x 20KB = 400+ terabytes
- One computer can read 30-35 MB/sec from disk
   ~four months to read the web
~1,000 hard drives just to store the web
   Even more to do something with the data
 
Using MapReduce: same problem can be solved with 1000 machines, < 3 hours


The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is complete, the Reduce function is called once for each unique key that was generated by Map and is given the key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted order.

Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes). 

MapReduce is not a general-purpose framework for all forms of parallel programming. Rather, it is designed specifically for problems that can be broken up into the the map-reduce paradigm.

Credit:
Much of this information is from below articles:
Distributed Systems course - www.cs.rutgers.edu/~pxk/417/notes

Saturday, April 13, 2013

The Google File System (GFS)

A brief overview on GFS.

GFS development was motivated by need of a scalable distributed file system. GFS supports large-scale data processing workloads on commodity hardware. In GFS files are divided in to fixed size chunks. And replicated over chunkservers to deliver aggregate performance and fault tolerance. Each chunk has a unique 64 bit chunk handle.

GFS has single master for simplicity and multiple chunkservers(replicas). Master and chunkservers coordinate using heartbeat messages. GFS is fault tolerant and supports TeraBytes of space.

Here is the architecture diagram from GFS paper.



In above diagram, the GFS client contact GFS master to obtain chunk location. And then contact one of the chunkservers to obtain data.

Reference: