Chapter 4. Writing basic MapReduce programs

 

This chapter covers

  • Patent data as an example data set to process with Hadoop
  • Skeleton of a MapReduce program
  • Basic MapReduce programs to count statistics
  • Hadoop’s Streaming API for writing MapReduce programs using scripting languages
  • Combiner to improve performance

The MapReduce programming model is unlike most programming models you may have learned. It’ll take some time and practice to gain familiarity. To help develop your proficiency, we go through many example programs in the next couple chapters. These examples will illustrate various MapReduce programming techniques. By applying MapReduce in multiple ways you’ll start to develop an intuition and a habit of “MapReduce thinking.” The examples will cover simple tasks to advanced uses. In one of the advanced applications we introduce the Bloom filter, a data structure not normally taught in the standard computer science curriculum. You’ll see that processing large data sets, whether you’re using Hadoop or not, often requires a rethinking of the underlying algorithms.

We assume you already have a basic grasp of Hadoop. You can set up Hadoop, and you have compiled and run an example program, such as word counting from chapter 1. Let’s use examples—from a real-world data set.

4.1. Getting the patent data set

 
 
 

4.2. Constructing the basic template of a MapReduce program

 
 

4.3. Counting things

 
 
 
 

4.4. Adapting for Hadoop’s API changes

 
 
 

4.5. Streaming in Hadoop

 
 
 

4.6. Improving performance with combiners

 
 
 

4.7. Exercising what you’ve learned

 
 
 
 

4.8. Summary

 
 
 
 

4.9. Further resources

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest