Chapter 15. Big data and MapReduce


This chapter covers

  • MapReduce
  • Using Python with Hadoop Streaming
  • Automating MapReduce with mrjob
  • Training support vector machines in parallel with the Pegasos algorithm

I often hear “Your examples are nice, but my data is big, man!” I have no doubt that you work with data sets larger than the examples used in this book. With so many devices connected to the internet and people interested in making data-driven decisions, the amount of data we’re collecting has outpaced our ability to process it. Fortunately, a number of open source software projects allow us to process large amounts of data. One project, called Hadoop, is a Java framework for distributing data processing to multiple machines.

Imagine for a second that you work for a store that sells items on the internet, and you get many visitors—some purchasing items, some leaving before they purchase items. You’d like to be able to identify the ones who make purchases. How do you do this? You can look at the web server logs and see what pages each person went to. Perhaps some other actions are recorded; if so, you can train a classifier on these actions. The only problem is that this dataset may be huge, and it may take multiple days to train this classifier on a single machine. This chapter will show you some tools you can use to solve a problem like this: Hadoop and some Python tools built on top of Hadoop.

15.1. MapReduce: a framework for distributed computing

15.2. Hadoop Streaming

15.3. Running Hadoop jobs on Amazon Web Services

15.4. Machine learning in MapReduce

15.5. Using mrjob to automate MapReduce in Python

15.6. Example: the Pegasos algorithm for distributed SVMs

15.7. Do you really need MapReduce?

15.8. Summary