This chapter covers
- Recognizing the reduce pattern for N-to-X data transformations
- Writing helper functions for reductions
- Writing lambda functions for simple reductions
- Using reduce to summarize data
In the previous chapters of the book, we’ve focused on developing a foundational set of programming patterns—in the map and reduce style—that allow us to scale our programming. We can use the techniques we’ve covered so far to make the most of our laptop’s hardware. I’ve shown you how to work on large datasets using techniques like map (chapter 2), reduce (chapter 5), parallelism (chapter 2), and lazy programming (chapter 4). In this chapter, we begin to look at working on big datasets beyond our laptop.
In this chapter, we introduce distributed computing—that is, computing that occurs on more than one computer—and two technologies we’ll use to do distributed computing: Apache Hadoop and Apache Spark. Hadoop is a set of tools that support distributed map and reduce style programming through Hadoop MapReduce. Spark is an analytics toolkit designed to modernize Hadoop. We’ll focus on Hadoop for batch processing of big datasets and focus on applying Spark in analytics and machine learning use cases.