This chapter covers
- Joining two data frames together.
- Selecting the right type of join for your use-case.
- Grouping data and understanding the
GroupedDatatransitional object.
- Breaking the
GroupedDatawith an aggregation method and getting a summarized data frame.
- Filling null values in your data frame
In chapter 4, we looked at how we can transform a data frame using selection, dropping, creation, renaming, re-ordering, and summary of columns. Those operations constitute the foundation working with a data frame in PySpark. In this chapter, I will complete the review of the most common operations you will perform on a data frame: linking or joining data frames together, as well as grouping data (and performing operations on the GroupedData object). We conclude this chapter by wrapping our exploratory program into a single script we can submit, just like we performed in chapter 3. The skills learned in this chapter complete the set of fundamental operations you will use in your day to day work transforming data.
We use the same logs table that we left in chapter 4. In practical steps, this chapter’s code enriches our table with the relevant information contained in the link tables and then get summarized into relevant groups, using what can be considered a graduate version of the describe() method I show in chapter 4. If you want to catch up with a minimal amount of fuzz, I provide an end_of_chapter.py script in the src/Ch04 directory.