9 Big data is just a lot of small data: Using pandas UDFs

This chapter covers

Using pandas Series UDFs to accelerate column transformation compared to Python UDFs
Addressing the cold start of some UDFs using Iterator of Series UDF
Controlling batch composition in a split-apply-combine programming pattern
Confidently making a decision about the best pandas UDF to use

This chapter approaches the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame, and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions as if they were single-node data frames? More interestingly, what if we control how those single-node partitions are created and used using a tool we know? What about pandas?

PySpark’s interoperability with pandas (also colloquially called pandas UDF) is a huge selling point when performing data analysis at scale. pandas is the dominant in-memory Python data manipulation library, while PySpark is the dominantly distributed one. Combining both of them unlocks additional possibilities. In this chapter, we start by scaling some basic pandas data manipulation functionality. We then look into operations on GroupedData and how PySpark plus Pandas implement the split-apply-combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of pandas DataFrames.

9.1 Column transformations with pandas: Using Series UDF

9.1.1 Connecting Spark to Google’s BigQuery

9.1.2 Series to Series UDF: Column functions, but with pandas

9.1.3 Scalar UDF + cold start = Iterator of Series UDF

9.2 UDFs on grouped data: Aggregate and apply

9.2.1 Group aggregate UDFs

9.2.2 Group map UDF

9.3 What to use, when

Summary

Additional exercises

Exercise 9.2

Exercise 9.3

Exercise 9.4

Exercise 9.5