This chapter covers
- Using JSON to transfer complex data structures between Apache Streaming steps
- Writing mrjob scripts to interact with Hadoop without Apache Streaming
- Thinking about mappers and reducers as key-value consumers and producers
- Analyzing web traffic logs and tennis match logs with Apache Hadoop
In chapter 7, we learned about two distributed frameworks for processing large datasets: Hadoop and Spark. In this chapter, we’ll dive deep into Hadoop—the Java-based large dataset processing framework. As we touched on last chapter, Hadoop has a lot of benefits. We can use Hadoop to process
- lots of data fast—distributed parallelization
- data that’s important—low data loss
- absolutely enormous amounts of data—petabyte scale
- To use Hadoop with Python, we need to use the Hadoop Streaming utility.
- We need to repeatedly read in strings from stdin.
- The error messages for Java are not super helpful.
In this chapter, we’ll look at how we can deal with those issues by working through some scenarios. We’ll analyze the skill of tennis players over time and find the most talented players in the sport.