chapter five

5 Data frame gymnastics: Joining and grouping

This chapter covers

Joining two data frames together
Selecting the right type of join for your use case
Grouping data and understanding the GroupedData transitional object
Breaking the GroupedData with an aggregation method
Filling null values in your data frame

In chapter 4, we looked at how we can transform a data frame using selection, dropping, creation, renaming, reordering, and creating a summary of columns. Those operations constitute the foundation for working with a data frame in PySpark. In this chapter, I will complete the review of the most common operations you will perform on a data frame: linking or joining data frames, as well as grouping data (and performing operations on the GroupedData object). We conclude this chapter by wrapping our exploratory program into a single script we can submit, just like we performed in chapter 3. The skills learned in this chapter complete the set of fundamental operations you will use in your day-to-day work transforming data.

We use the same logs data frames that we left in chapter 4. In practical steps, this chapter’s code enriches our table with the relevant information contained in the link tables and then summarizes it in relevant groups, using what can be considered a graduate version of the describe() method I show in chapter 4. If you want to catch up with a minimal amount of fuss, I provide a checkpoint.py script in the code/Ch04-05 directory.

5.1 From many to one: Joining data

5.1.1 What’s what in the world of joins

5.1.2 Knowing our left from our right

5.1.3 The rules to a successful join: The predicates

5.1.4 How do you do it: The join method

5.1.5 Naming conventions in the joining world

5.2 Summarizing the data via groupby and GroupedData

5.2.1 A simple groupby blueprint

5.2.2 A column is a column: Using agg() with custom column definitions

5.3 Taking care of null values: Drop and fill

5.3.1 Dropping it like it’s hot: Using dropna() to remove records with null values

5.3.2 Filling values to our heart’s content using fillna()

5.4 What was our question again? Our end-to-end program

Summary

Additional exercises