part two

Part 2. Get proficient: Translate your ideas into code

With two different kind of programs under your belt, it’s time to expand our horizons. Part 2 is about diversifying your set of tools so that no data set will have a secret for you.

Chapter 6 breaks the rows and columns mold to go multidimensional. Through JSON data, we build data frames that contain data frames themselves. This tool catapults the versatility of the Spark data frame to completely new horizons.

Chapter 7 introduces PySpark and SQL together. Together, they unlock a new level of expressiveness and succinctness in your code, allow you to scale SQL workflows at record speed, and provide a new way to reason about your analyses.

Chapters 8 and 9 cover going full Python with your PySpark code. From the resilient distributed data set, a flexible and scalable data structure, to two flavors of UDF using Python and pandas, you’ll turbocharge your capabilities with full confidence.

Chapter 10 provides a new angle on your data through the introduction of window functions. Window functions are one of those things that make ordered data so much easier to work with that you’ll wonder how anyone can do without them.

Finally, chapter 11 takes a break from all that coding to reflect on Spark’s execution model. You’ll check under the hood through the Spark UI and better understand how your instructions are being processed by the engine.