part three

Part 3. Transforming your data

Spark is clearly about transforming data, and I barely touched the surface in the first two parts. It’s about time to do some heavy data lifting.

You’ll start working with SQL in chapter 11. SQL is not only the de facto standard for manipulating data, but also the lingua franca of all data engineers and data scientists. It seems that SQL has always been around and will clearly be around for a long time. Adding SQL support was a smart move by the creators of Spark. Let’s dive into it and understand how it works.

Chapter 12 will teach you how to perform transformations. After an explanation of what transformations are, you’ll start by performing record transformations and understanding the classic process of data discovery, data mapping, application engineering, execution, and review. I believe that innovation comes from the intersection of culture, science, and art. This applies to data: it is by joining datasets that you will discover more insightful data. This will be an important part of this chapter.

Just as chapter 12 covers individual records, chapter 13 explores transformation at the document level. You will also build nested documents. Static functions are very helpful in all transformations, and you will learn more about them.

Chapter 14 is all about extending Spark by using user-defined functions. Spark is not a finite system; it can be extended, and you can leverage your existing work.