chapter eight

8 Extending PySpark with Python: RDD and UDFs

This chapter covers

Using the RDD as a low-level, flexible data container
Manipulating data in the RDD using higher-order functions
How to promote regular Python functions to UDFs to run in a distributed fashion
How to apply UDFs on local data to ease debugging

Our journey with PySpark thus far has proven that it is a powerful and versatile data-processing tool. So far, we’ve explored many out-of-the-box functions and methods to manipulate data in a data frame. Recall from chapter 1 that PySpark’s data frame manipulation functionality takes our Python code and applies an optimized query plan. This makes our data jobs efficient, consistent, and predictable, just like coloring within the lines. What if we need to go off-script and manipulate our data according to our own rules?

8.1 PySpark, freestyle: The RDD

8.1.1 Manipulating data the RDD way: map(), filter(), and reduce()

8.2 Using Python to extend PySpark via UDFs

8.2.1 It all starts with plain Python: Using typed Python functions

8.2.2 From Python function to UDFs using udf()

Summary

Additional exercises

Exercise 8.3

Exercise 8.4

Exercise 8.5

Exercise 8.6