chapter four

4 Analyzing tabular data with pyspark.sql

This chapter covers

Reading delimited data into a PySpark data frame
Understanding how PySpark represents tabular data in a data frame
Ingesting and exploring tabular or relational data
Selecting, manipulating, renaming, and deleting columns in a data frame
Summarizing data frames for quick exploration

Our first example in chapters 2 and 3 worked with unstructured textual data. Each line of text was mapped to a record in a data frame, and, through a series of transformations, we counted word frequencies from one (and multiple) text files. This chapter goes deeper into data transformation, this time using structured data. Data comes in many shapes and forms: we start with relational (or tabular,¹ or row and columns) data, one of the most common formats popularized by SQL and Excel. This chapter and the next follow the same blueprint as we did with our first data analysis. We use the public Canadian television schedule data to identify and measure the proportion of commercials over its total programming.

4.1 What is tabular data?

4.1.1 How does PySpark represent tabular data?

4.2 PySpark for analyzing and processing tabular data

4.3 Reading and assessing delimited data in PySpark

4.3.1 A first pass at the SparkReader specialized for CSV files

4.3.2 Customizing the SparkReader object to read CSV data files

4.3.3 Exploring the shape of our data universe

4.4 The basics of data manipulation: Selecting, dropping, renaming, ordering, diagnosing

4.4.1 Knowing what we want: Selecting columns

4.4.2 Keeping what we need: Deleting columns

4.4.3 Creating what’s not there: New columns with withColumn()

4.4.4 Tidying our data frame: Renaming and reordering columns

4.4.5 Diagnosing a data frame with describe() and summary()

Summary

Additional exercises

Exercise 4.3

Exercise 4.4