chapter nine

9 Big data is just a lot of small data: using pandas UDF

This chapter covers

Using pandas Series UDF to accelerate column transformation compared to Python UDF.
Addressing the cold start of some UDF using Iterator of Series UDF.
Controlling batch composition in a split-apply-combine programming pattern.
Confidently making a decision about the best pandas UDF to use.

This chapter approach the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions like if they were single-node data frames? More interestingly, what if we control how those single-node partitions were created and used? Using a tool we know? What about pandas?

PySpark’s interoperability with Pandas (also colloquially called Pandas UDF) is a huge selling point when performing data analysis at scale. Pandas is the dominant in-memory Python data manipulation library where PySpark is the dominant distributed one. Combining both of them together unlocks additional possibility: in this chapter, we start by scaling some basic pandas data manipulation functionality. We then look into operations on GroupedData and how PySpark+Pandas implement the split apply combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of Pandas data frame.

9.1 Column transformations with Pandas: using Series UDF

9.1.1 Connecting Spark to Google’s BigQuery

9.1.2 Series to Series UDF: column functions, but with Pandas

9.1.3 Scalar UDF + cold start = Iterator of Series UDF

9.2 UDF on grouped data: aggregate and apply

9.2.1 Group aggregate UDF

9.2.2 Grouped map UDF

9.3 What to use when?

9.4 Summary

9.5 Additional Exercises

9.5.1 Exercise 9.2

9.5.2 Exercise 9.3

9.5.3 Exercise 9.4

9.5.4 Exercise 9.5