concept Python UDF in category pyspark

appears as: Python UDF, Python UDF
Data Analysis with Python and PySpark MEAP V07

This is an excerpt from Manning's book Data Analysis with Python and PySpark MEAP V07.

To support our new makeshift fraction type, we create a few functions that provide basic functionality. This is a perfect job for Python UDF, and I take the opportunity to introduce the two ways PySpark enables its creation.

Scalar UDF are the most common type of pandas UDF. As their name indicates, they work on scalar values: for each record passed in, it needs to return one record. They behave just like regular Python UDF, with one key difference. Python UDF work on one record at a time and you express your logic through regular Python code. Scalar UDF work on one series at a time and you express your logic through pandas code. The difference is subtle and it’s easier to explain visually.

In a Python UDF, when you pass column objects to you UDF, PySpark will unpack each value, perform the computation, and then return the value for each record in a Column object. In a Scalar UDF, depicted in figure 8.5, PySpark will serialize (through PyArrow) each partitioned column into a pandas Series object (pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html). You then perform the operations on the Series object directly, returning a Series of the same dimension from your UDF. From an end-user perspective, they are the same functionally In Chapter 9, I discuss the performance implications of Python UDF vs. a scalar UDF.

Figure 8.5. Comparing a Python UDF to a pandas scalar UDF. The former splits a column in individual records, where the latter breaks them in Series.
ch08 python vs pandas udf
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage
test yourself with a liveTest