This chapter covers
- Using pandas Series UDF to accelerate column transformation compared to Python UDF.
- Addressing the cold start of some UDF using Iterator of Series UDF.
- Controlling batch composition in a split-apply-combine programming pattern.
- Confidently making a decision about the best pandas UDF to use.
This chapter approach the distributed nature of PySpark a little differently. If we take a few seconds to think about it, we read data into a data frame and Spark distributes the data across partitions on nodes. What if we could directly operate on those partitions like if they were single-node data frames? More interestingly, what if we control how those single-node partitions were created and used? Using a tool we know? What about pandas?
PySpark’s interoperability with Pandas (also colloquially called Pandas UDF) is a huge selling point when performing data analysis at scale. Pandas is the dominant in-memory Python data manipulation library where PySpark is the dominant distributed one. Combining both of them together unlocks additional possibility: in this chapter, we start by scaling some basic pandas data manipulation functionality. We then look into operations on GroupedData and how PySpark+Pandas implement the split apply combine pattern common to data analysis. We finish with the ultimate interaction between pandas and PySpark: treating a PySpark data frame like a small collection of Pandas data frame.