Chapter 12. MapReduce in the cloud with Amazon’s Elastic MapReduce

 

This chapter covers

  • Launching and configuring cloud compute clusters with Elastic MapReduce
  • Running Hadoop jobs in the cloud with mrjob
  • Distributed cloud machine learning with Spark

Throughout this book, we’ve been talking about the ability to scale code up. We started by looking at how to parallelize code locally; then we moved on to distributed computing frameworks; and finally, in chapter 11, we introduced cloud computing technologies. In this chapter, we’ll look at techniques we can use to work with data of any scale. We’ll see how to take the Hadoop and Spark frameworks we covered in the middle of the book (chapters 7 and 8 for Hadoop; chapters 7, 9, and 10 for Spark) and bring them into the cloud with Amazon Elastic MapReduce. We’ll start by looking at how to bring Hadoop into the cloud with mrjob—a framework for Hadoop and Python that we introduced in chapter 8. Then, we’ll look at bringing Spark and its machine learning capabilities into the cloud.

12.1. Running Hadoop on EMR with mrjob

In chapter 8, we reviewed two methods of working with Hadoop:

  1. Hadoop Streaming— Which uses Python scripts for its mappers and reducers
  2. mrjob— Which we can use to do Hadoop jobs using only Python code

12.2. Machine learning in the cloud with Spark on EMR

 
 
 

12.3. Exercises

 

Summary

 
 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage