This chapter covers
- Launching and configuring cloud compute clusters with Elastic MapReduce
- Running Hadoop jobs in the cloud with mrjob
- Distributed cloud machine learning with Spark
Throughout this book, we’ve been talking about the ability to scale code up. We started by looking at how to parallelize code locally; then we moved on to distributed computing frameworks; and finally, in chapter 11, we introduced cloud computing technologies. In this chapter, we’ll look at techniques we can use to work with data of any scale. We’ll see how to take the Hadoop and Spark frameworks we covered in the middle of the book (chapters 7 and 8 for Hadoop; chapters 7, 9, and 10 for Spark) and bring them into the cloud with Amazon Elastic MapReduce. We’ll start by looking at how to bring Hadoop into the cloud with mrjob—a framework for Hadoop and Python that we introduced in chapter 8. Then, we’ll look at bringing Spark and its machine learning capabilities into the cloud.
In chapter 8, we reviewed two methods of working with Hadoop:
- Hadoop Streaming— Which uses Python scripts for its mappers and reducers
- mrjob— Which we can use to do Hadoop jobs using only Python code