In chapter 4, we looked at how we can transform a data frame using selection, dropping, creation, renaming, reordering, and creating a summary of columns. Those operations constitute the foundation for working with a data frame in PySpark. In this chapter, I will complete the review of the most common operations you will perform on a data frame: linking or joining data frames, as well as grouping data (and performing operations on the GroupedData object). We conclude this chapter by wrapping our exploratory program into a single script we can submit, just like we performed in chapter 3. The skills learned in this chapter complete the set of fundamental operations you will use in your day-to-day work transforming data.
We use the same logs data frames that we left in chapter 4. In practical steps, this chapter’s code enriches our table with the relevant information contained in the link tables and then summarizes it in relevant groups, using what can be considered a graduate version of the describe() method I show in chapter 4. If you want to catch up with a minimal amount of fuss, I provide a checkpoint.py script in the code/Ch04-05 directory.