8 Extending PySpark with Python: RDD and user-defined-functions

This chapter covers:

How to use the RDD as a low level, flexible data container.
How to promote regular Python functions to UDF to run in a distributed fashion.
How to use scalar UDF as an alternative to Python UDF, using pandas' API.
How to use grouped map and grouped aggregate UDF on GroupedData object to split data frame computation on manageable chunks.
How to apply UDF on local data to ease debugging.

Our journey with PySpark so far has proven that it is a powerful and versatile data processing tool. So far, we’ve explored many out-of-the-box functions and methods to manipulate data in a data frame. PySpark’s data frame manipulation functionality takes our Python code and applies an optimized query plan, introduced in Chapter 1. This makes our data jobs efficient, consistent, and predictable, just like coloring within the lines. What if we need to go off-script and manipulate our data according to our own rules?

In this chapter, I cover how we can build Python functions and scale them in PySpark. I start by introducing the resilient distributed dataset (or RDD), a more primitive and lower-level structure compared to the data frame. I explain how you manipulate data in an RDD and how its element (or row) major nature complements the data frame column-major approach.

8.1 PySpark, freestyle: the resilient distributed dataset

8.1.1 Manipulating data the RDD way: map, filter and reduce

8.2 Using Python to extend PySpark via user-defined functions

8.2.1 It all starts with plain Python: using typed Python functions

8.2.2 From Python function to UDF: two approaches

8.3 Big data is just a lot of small data: using pandas UDF

8.3.1 Setting our environment: connectors and libraries

8.3.2 Preparing our data

8.3.3 Scalar UDF

8.3.4 Grouped map UDF

8.3.5 Grouped aggregate UDF

8.3.6 Going local to troubleshoot pandas UDF

8.4 Summary

8.5 Exercises

8.5.1 Exercise 8.1

8.5.2 Exercise 8.2

8.5.3 Exercise 8.3

8.5.4 Exercise 8.4

8.5.5 Exercise 8.5