Chapter 7. Processing truly big datasets with Hadoop and Spark

 

This chapter covers

  • Recognizing the reduce pattern for N-to-X data transformations
  • Writing helper functions for reductions
  • Writing lambda functions for simple reductions
  • Using reduce to summarize data

In the previous chapters of the book, we’ve focused on developing a foundational set of programming patterns—in the map and reduce style—that allow us to scale our programming. We can use the techniques we’ve covered so far to make the most of our laptop’s hardware. I’ve shown you how to work on large datasets using techniques like map (chapter 2), reduce (chapter 5), parallelism (chapter 2), and lazy programming (chapter 4). In this chapter, we begin to look at working on big datasets beyond our laptop.

In this chapter, we introduce distributed computing—that is, computing that occurs on more than one computer—and two technologies we’ll use to do distributed computing: Apache Hadoop and Apache Spark. Hadoop is a set of tools that support distributed map and reduce style programming through Hadoop MapReduce. Spark is an analytics toolkit designed to modernize Hadoop. We’ll focus on Hadoop for batch processing of big datasets and focus on applying Spark in analytics and machine learning use cases.

7.1. Distributed computing

 
 
 

7.2. Hadoop for batch processing

 
 
 

7.3. Using Hadoop to find high-scoring words

 
 

7.4. Spark for interactive workflows

 
 
 

7.5. Document word scores in Spark

 
 
 

7.6. Exercises

 
 

Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage