Chapter 8. Best practices for large data with Apache Streaming and mrjob

 

This chapter covers

  • Using JSON to transfer complex data structures between Apache Streaming steps
  • Writing mrjob scripts to interact with Hadoop without Apache Streaming
  • Thinking about mappers and reducers as key-value consumers and producers
  • Analyzing web traffic logs and tennis match logs with Apache Hadoop

In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits. We can use Hadoop to process

  • lots of data fast—distributed parallelization
  • data that’s important—low data loss
  • absolutely enormous amounts of data—petabyte scale

Unfortunately, we also saw some drawbacks to working with Hadoop:

  • To use Hadoop with Python, we need to use the Hadoop Streaming utility.
  • We need to repeatedly read in strings from stdin.
  • The error messages for Java are not super helpful.

In this chapter, we’ll look at how we can deal with those issues by working through some scenarios. We’ll analyze the skill of tennis players over time and find the most talented players in the sport.

8.1. Unstructured data: Logs and documents

 
 

8.2. Tennis analytics with Hadoop

 
 

8.3. mrjob for Pythonic Hadoop streaming

 

8.4. Tennis match analysis with mrjob

 
 
 

8.5. Exercises

 
 
 
 

Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage