chapter nine

9 Big data is just a lot of small data: using pandas UDF

 

This chapter covers

  • Using pandas Series UDF to accelerate column transformation compared to Python UDF.
  • Addressing the cold start of some UDF using Iterator of Series UDF.
  • Controlling batch composition in a split-apply-combine programming pattern.
  • Confidently making a decision about the best pandas UDF to use.

This chapter approach the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions like if they were single-node data frames? More interestingly, what if we control how those single-node partitions were created and used? Using a tool we know? What about pandas?

PySpark’s interoperability with Pandas (also colloquially called Pandas UDF) is a huge selling point when performing data analysis at scale. Pandas is the dominant in-memory Python data manipulation library where PySpark is the dominant distributed one. Combining both of them together unlocks additional possibility: in this chapter, we start by scaling some basic pandas data manipulation functionality. We then look into operations on GroupedData and how PySpark+Pandas implement the split apply combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of Pandas data frame.

9.1 Column transformations with Pandas: using Series UDF

9.1.1 Connecting Spark to Google’s BigQuery

9.1.2 Series to Series UDF: column functions, but with Pandas

9.1.3 Scalar UDF + cold start = Iterator of Series UDF

9.2 UDF on grouped data: aggregate and apply

9.2.1 Group aggregate UDF

9.2.2 Grouped map UDF

9.3 What to use when?

9.4 Summary

9.4.1 Exercise 9.1

9.4.2 Exercise 9.2

9.4.3 Exercise 9.3

9.4.4 Exercise 9.4