Mapreduce is a paradigm for distributed computing. It provides a framework for parallel computing. MapReduce enables large scale distributed data processing. MapReduce can be applied to many large scale computing problems. The name MapReduce is inspired from Map and Reduce functions in LISP programming language.But it is not an implementation of this lisp functions.
From a user's perspective, there are two basic operations in MapReduce: Map and Reduce.
Typical problems solved by MapReduce:
Read a lot of data
Map: extract something you care about from each record.
Shuffle and sort - MapReduce framework will read output from Map phase, and perform sort and grouping.
Reduce: This phase aggregate, summarize, filter or transform data and write the results to file.
Why do we need distributed computing like MapReduce
Otherwise some problems are too big to solve
Example:
20+ billion web pages x 20KB = 400+ terabytes- One computer can read 30-35 MB/sec from disk
~four months to read the web
~1,000 hard drives just to store the web
Even more to do something with the data
Using MapReduce: same problem can be solved with 1000 machines, < 3 hours
The Map function reads a stream of data and parses it into intermediate (key, value) pairs. When that is complete, the Reduce function is called once for each unique key that was generated by Map and is given the key and a list of all values that were generated for that key as a parameter. The keys are presented in sorted order.
Programmers get a simple API and do not have to deal with issues of parallelization, remote execution, data distribution, load balancing, or fault tolerance. The framework makes it easy for one to use thousands of processors to process huge amounts of data (e.g., terabytes and petabytes).
MapReduce is not a general-purpose framework for all forms of parallel programming. Rather, it is designed specifically for problems that can be broken up into the the map-reduce paradigm.
Credit:
Much of this information is from below articles:
Distributed Systems course - www.cs.rutgers.edu/~pxk/417/notesMuch of this information is from below articles: