8 Best practices for large data with Apache Streaming and MRJob

 

This chapter covers

  • Using JSON to transfer complex data structures between Apache Streaming steps
  • Writing mrjob scripts to interact with Hadoop without Apache Streaming
  • Thinking about mapper and reducer as key-value consumers and producers
  • Analyzing web traffic logs and tennis match logs with Apache Hadoop

In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits.

  • We can use Hadoop to process lots of data fast—distributed parallelization
  • We can use Hadoop to process data that’s important—low data loss
  • We can use Hadoop to process absolutely enormous amounts of data—petabyte scale

Unfortunately, in the last chapter we also saw some drawbacks to working with Hadoop:

  • To use Hadoop with Python we need to use the Hadoop Streaming utility
  • We need to repeatedly read in strings from stdin
  • The error messages for Java are not super helpful

In this chapter, we’ll look at how we can deal with those issues by working through two scenarios. In the first scenario, we’ll analyze the skill of tennis players over time and find the most talented players in the sport. In the second scenario, we’ll analyze web traffic to find out when our visitors come to our website the most and which pages are causing problems for our users.

8.1   Unstructured data: logs and documents

8.2   Tennis analytics with Hadoop

8.2.1   A mapper for reading match data

8.2.2   Reducer for calculating tennis player ratings

8.3   mrjob for Pythonic Hadoop streaming

8.3.1   The Pythonic structure of a mrjob job

8.3.2   Counting errors with mrjob

8.4   Tennis match analysis with mrjob

8.4.1   Counting Serena’s dominance by court type

8.4.2   Sibling rivalry for the ages

8.5   Summary