With two different kind of programs under your belt, it’s time to expand our horizons. Part 2 is about diversifying your set of tools so that no data set will have a secret for you.
Chapter 6 breaks the rows and columns mold to go multidimensional. Through JSON data, we build data frames that contain data frames themselves. This tool catapults the versatility of the Spark data frame to completely new horizons.
Chapter 7 introduces PySpark and SQL together. Together, they unlock a new level of expressiveness and succinctness in your code, allow you to scale SQL workflows at record speed, and provide a new way to reason about your analyses.
Chapters 8 and 9 cover going full Python with your PySpark code. From the resilient distributed data set, a flexible and scalable data structure, to two flavors of UDF using Python and pandas, you’ll turbocharge your capabilities with full confidence.