chapter seven

7 Processing truly big datasets with Hadoop and Spark

This chapter covers

Recognizing the reduce pattern for N-to-X data transformations
Writing helper functions for reductions
Writing lambda functions for simple reductions
Using reduce to summarize data

In the previous chapters of the book we’ve focused on developing a foundational set of programming patterns—in the map and reduce style—that allow us to scale our programming. We can use the techniques we’ve covered so far to make the most of our laptop’s hardware. I’ve shown you how to work on large datasets using techniques like map (chapter 2), reduce (chapter 5), parallelism (chapter 2), and lazy programming (chapter 4). In this chapter, we begin to look at working on big datasets beyond our laptop.

In this chapter we introduce distributed computing—that is computing that occurs on more than one computer—and two technologies we’ll use to do distributed computing: Apache Hadoop and Apache Spark. Hadoop is a set of tools that support distributed map and reduce style programming through Hadoop MapReduce. Spark is an analytics toolkit designed to modernize Hadoop. We’ll focus on Hadoop for batch processing of big datasets and focus on applying Spark in analytics and machine learning use cases.

7.1 Distributed computing

7.2 Hadoop for batch processing

7.2.1 Getting to know the four Hadoop modules

7.3 Using Hadoop to find high scoring words

7.3.1 MapReduce jobs using Python and Hadoop Streaming

7.3.2 Scoring words using Hadoop Streaming

7.4 Spark for interactive workflows

7.4.1 Big datasets in-memory with Spark

7.4.2 PySpark for mixing Python and Spark

7.4.3 Enterprise data analytics Spark SQL

7.4.4 Columns of data with Spark DataFrame

7.5 Document word scores in Spark

7.5.1 Setting up Spark

7.5.2 MapReduce Spark jobs with spark-submit

7.6 Summary