This chapter covers
- Options for parallel map and reduce routines in PySpark
- Convenience methods of PySpark’s RDD class for common operations
- Implementing the historic PageRank algorithm in PySpark
In chapter 7, we learned about Hadoop and Spark, two frameworks for distributed computing. In chapter 8, we dove into the weeds of Hadoop, taking a close look at how we might use it to parallelize our Python work for large datasets. In this chapter, we’ll become familiar with PySpark—the Scala-based, in-memory, large dataset processing framework.
As mentioned in chapter 7, Spark has some advantages:
- Spark can be very, very fast.
- Spark programs use all the same map and reduce techniques we learned about in chapters 2 through 6.
- We can code our Spark programs entirely in Python, taking advantage of the thorough PySpark API.
In this chapter, we’ll take a look at how we can make the most of PySpark by focusing on its foundational class: the RDD—Resilient Distributed Dataset. We’ll explore the map and reduce-like methods of the RDD that we can use to perform familiar map and reduce workflows in parallel. We’ll learn about some of the RDD class’s convenience methods that make our lives easier. And we’ll learn all this by implementing the PageRank algorithm—the simple but elegant ranking algorithm that once formed the backbone of Google’s search.