chapter four

Chapter 4. Analyzing data with pyspark.sql

This chapter covers:

Reading delimited data into a PySpark data frame
Ingesting and exploring tabular or relational data
Selecting, manipulating, renaming and deleting columns in a data frame
Performing simple joins between two data frames
Grouping data and computing summary on a data frame

So far, in Chapter 2 and 3, we’ve dealt with textual data, which is unstructured. Through a chain of transformations, we extracted some information in order to get the most common words in the text. This Chapter will go a little deeper into data manipulation using structured data, which is data that follow a set format. More specifically, we will work with tabular data, which follows the classical rows and columns layout. Just like the two previous chapters, we’ll take a data set and answer a simple question by exploring and processing the data.

We’ll use some public Canadian television schedule data to identify and measure the proportion of commercials over the total programming. The data used is typical of what you see from mainstream relational databases. We’ll build on our prior knowledge, push the envelope a little further, and by the end, you’ll know how to wrangle the most common type of data to answer your own questions.

4.1 What is tabular data?

4.1.1 How does PySpark represents tabular data?

4.2 PySpark for analyzing and processing data

4.3 Reading delimited data in PySpark

4.3.1 Customizing the `SparkReader` object to read DSV data

4.3.2 Exploring the shape of our data

4.4 Manipulating structured data

4.4.1 Selecting columns

4.4.2 Summarizing your data frame

4.4.3 Keeping what we need: deleting columns

4.4.4 Creating new columns with `withColumn()`

4.4.5 Renaming and re-ordering columns

4.5 From many to one: joining data

4.5.1 The blueprint of a simple join

4.6 Summarizing the data via: groupby and GroupedData

4.6.1 A simple groupby blueprint

4.6.2 A column is a column: using agg with custom column definitions

4.6.3 Cache & Persist: Saving your progress 4

4.6.4 How caching works

4.6.5 Balancing caching and processing