This chapter covers
- Using the RDD as a low level, flexible data container.
- Manipulating data in the RDD using higher order functions.
- How to promote regular Python functions to UDF to run in a distributed fashion.
- How to apply UDF on local data to ease debugging.
Our journey with PySpark so far has proven that it is a powerful and versatile data processing tool. So far, we’ve explored many out-of-the-box functions and methods to manipulate data in a data frame. We recall from chapter 1 that PySpark’s data frame manipulation functionality takes our Python code and applies an optimized query plan. This makes our data jobs efficient, consistent, and predictable, just like coloring within the lines. What if we need to go off-script and manipulate our data according to our own rules?