Appendix B. Understanding MapReduce
In December 2004, Google published a paper, “MapReduce: Simplified Data Processing on Large Clusters,” by Jeffrey Dean and Sanjay Ghemawat, summarizing the authors’ solution to Google’s urgent need to simplify cluster computing. Dean and Ghemawat settled on a paradigm in which parts of a job are mapped (dispatched) to all nodes in a cluster. Each node produces a slice of the intermediary result set. All those slices are then reduced (aggregated) back to the final result.
The MapReduce paper (http://mng.bz/8s06) solves these three main problems:
- Parallelization— How to parallelize the computation
- Distribution— How to distribute the data
- Fault-tolerance— How to handle component failure
The core of MapReduce deals with programmatic resolution of those three problems, which effectively hides most of the complexities of dealing with large-scale distributed systems and allows MapReduce to expose a minimal API that consists only of two functions: (wait for it ...) map and reduce.
One of the key insights from MapReduce is that you shouldn’t be forced to move data in order to process it. Instead, a program is sent to where the data resides. That is a key differentiator, compared to traditional data warehouse systems and relational databases. There’s simply too much data to be moved around.
We’ll explain how MapReduce works with a simple example—we’ll use the good-old simplified word count (big data’s “Hello World”).