Chapter 6. Applying MapReduce patterns to big data
This chapter covers
- Learning how to join data with map-side and reduce-side joins
- Understanding how a secondary sort works
- Discovering how partitioning works and how to globally sort data
With your data safely in HDFS, it’s time to learn how to work with that data in MapReduce. Previous chapters showed you some MapReduce snippets in action when working with data serialization. In this chapter we’ll look at how to work effectively with big data in MapReduce to solve common problems.
MapReduce basics
If you want to understand the mechanics of Map-Reduce and how to write basic MapReduce programs, it’s worth your time to read Hadoop in Action by Chuck Lam (Manning, 2010).
MapReduce contains many powerful features, and in this chapter we’ll focus on joining, sorting, and sampling. These three patterns are important because they’re natural operations you’ll want to perform on your big data, and the goal of your clusters should be to squeeze as much performance as possible out of your MapReduce jobs.
The ability to join disparate and sparse data is a powerful MapReduce feature, but an awkward one in practice, so we’ll also look at advanced techniques for optimizing join operations with large datasets. Examples of joins include combining log files with reference data from a database and inbound link calculations on web graphs.