8 Best practices for large data with Apache Streaming and MRJob
This chapter covers
- Using JSON to transfer complex data structures between Apache Streaming steps
- Writing mrjob scripts to interact with Hadoop without Apache Streaming
- Thinking about mapper and reducer as key-value consumers and producers
- Analyzing web traffic logs and tennis match logs with Apache Hadoop
In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits.
- We can use Hadoop to process lots of data fast—distributed parallelization
- We can use Hadoop to process data that’s important—low data loss
- We can use Hadoop to process absolutely enormous amounts of data—petabyte scale
Unfortunately, in the last chapter we also saw some drawbacks to working with Hadoop:
- To use Hadoop with Python we need to use the Hadoop Streaming utility
- We need to repeatedly read in strings from stdin
- The error messages for Java are not super helpful
In this chapter, we’ll look at how we can deal with those issues by working through two scenarios. In the first scenario, we’ll analyze the skill of tennis players over time and find the most talented players in the sport. In the second scenario, we’ll analyze web traffic to find out when our visitors come to our website the most and which pages are causing problems for our users.