Chapter 5. Advanced MapReduce

This chapter covers

Chaining multiple MapReduce jobs
Performing joins of multiple data sets
Creating Bloom filters

As your data processing becomes more complex you’ll want to exploit different Hadoop features. This chapter will focus on some of these more advanced techniques.

When handling advanced data processing, you’ll often find that you can’t program the process into a single MapReduce job. Hadoop supports chaining MapReduce programs together to form a bigger job. You’ll also find that advanced data processing often involves more than one data set. We’ll explore various joining techniques in Hadoop for simultaneously processing multiple data sets. You can code certain data processing tasks more efficiently when processing a group of records at a time. We’ve seen how Streaming natively supports the ability to process a whole split at a time, and the Streaming implementation of the maximum function takes advantage of this ability. We’ll see that the same is true for Java programs. We’ll discover the Bloom filter and implement it with a mapper that keeps state information across records.

Chapter 5. Advanced MapReduce

This chapter covers

5.1. Chaining MapReduce jobs

5.2. Joining data from different sources

5.3. Creating a Bloom filter

5.4. Exercising what you’ve learned

5.5. Summary

5.6. Further resources

Chapter 5. Advanced MapReduce

This chapter covers

5.1. Chaining MapReduce jobs

5.2. Joining data from different sources

5.3. Creating a Bloom filter

5.4. Exercising what you’ve learned

5.5. Summary

5.6. Further resources

Unable to load book!