chapter five

5 Data frame gymnastics: joining and grouping

This chapter covers

Joining two data frames together.
Selecting the right type of join for your use-case.
Grouping data and understanding the GroupedData transitional object.
Breaking the GroupedData with an aggregation method and getting a summarized data frame.
Filling null values in your data frame

In chapter 4, we looked at how we can transform a data frame using selection, dropping, creation, renaming, re-ordering, and summary of columns. Those operations constitute the foundation working with a data frame in PySpark. In this chapter, I will complete the review of the most common operations you will perform on a data frame: linking or joining data frames together, as well as grouping data (and performing operations on the GroupedData object). We conclude this chapter by wrapping our exploratory program into a single script we can submit, just like we performed in chapter 3. The skills learned in this chapter complete the set of fundamental operations you will use in your day to day work transforming data.

We use the same logs table that we left in chapter 4. In practical steps, this chapter’s code enriches our table with the relevant information contained in the link tables and then get summarized into relevant groups, using what can be considered a graduate version of the describe() method I show in chapter 4. If you want to catch up with a minimal amount of fuzz, I provide an end_of_chapter.py script in the src/Ch04 directory.

5.1 From many to one: joining data

5.1.1 What’s what in the world of joins

5.1.2 Knowing our left from our right

5.1.3 The rules to a successful join: the predicates

5.1.4 How do you do it: the join method

5.1.5 Naming conventions in the joining world

5.2 Summarizing the data via: `groupby` and GroupedData

5.2.1 A simple `groupby` blueprint

5.2.2 A column is a column: using `agg()` with custom column definitions

5.3 Taking care of null values: drop and fill

5.3.1 Dropping it like it’s hot: using `dropna()` to remove records with null values

5.3.2 Filling values to our heart’s content using `fillna()`

5.4 What was our question again: our end-to-end program

5.5 Summary

5.6 Exercises

5.6.1 Exercise 5.3

5.6.2 Exercise 5.4

5.6.3 Exercise 5.5