chapter seven

Chapter 7. Cookbook

This chapter covers

Passing custom parameters to tasks
Retrieving task-specific information
Creating multiple outputs
Interfacing with relational databases
Making output globally sorted

This book so far has covered the core techniques for making a MapReduce program. Hadoop is a big framework that supports many more functionalities than those core techniques. In this age of Bing and Google, you can look up specialized MapReduce techniques rather easily, and we don’t try to be an encyclopedic reference. In our own usage and from our discussion with other Hadoop users, we’ve found a number of techniques generally useful, techniques such as being able to take a standard relational database as input or output to a MapReduce job. We’ve collected some of our favorite “recipes” in this cookbook chapter.

7.1. Passing job-specific parameters to your tasks

In writing your Mapper and Reducer, you often want to make certain aspects configurable. For example, our joining program in chapter 5 is hardcoded to take the first data column as the join key. The program can be more generally applicable if the column for the join key can be specified by the user at run time. Hadoop itself uses a configuration object to store all the configuration properties for a job. You can use the same object to pass parameters to your Mapper and Reducer.

Chapter 7. Cookbook

This chapter covers

7.1. Passing job-specific parameters to your tasks

7.2. Probing for task-specific information

7.3. Partitioning into multiple output files

7.4. Inputting from and outputting to a database

7.5. Keeping all output in sorted order

7.6. Summary