Spark is clearly about transforming data, and I barely touched the surface in the first two parts. It’s about time to do some heavy data lifting.
You’ll start working with SQL in chapter 11. SQL is not only the de facto standard for manipulating data, but also the lingua franca of all data engineers and data scientists. It seems that SQL has always been around and will clearly be around for a long time. Adding SQL support was a smart move by the creators of Spark. Let’s dive into it and understand how it works.
Chapter 12 will teach you how to perform transformations. After an explanation of what transformations are, you’ll start by performing record transformations and understanding the classic process of data discovery, data mapping, application engineering, execution, and review. I believe that innovation comes from the intersection of culture, science, and art. This applies to data: it is by joining datasets that you will discover more insightful data. This will be an important part of this chapter.