chapter four

4 Analyzing tabular data with pyspark.sql

This chapter covers

Reading delimited data into a PySpark data frame
Understanding how PySpark represents tabular data in a data frame
Ingesting and exploring tabular or relational data
Selecting, manipulating, renaming and deleting columns in a data frame
Summarizing data frames for quick exploration

Our first example in chapters 2 and 3 worked with unstructured textual data. Each line of text was mapped to a record into a data frame and, through a series of transformations, we counted word frequencies from one (and multiple) text files. This chapter goes deeper into data transformation, this time using structured data. Data comes in many shape and forms: we start relational (or _tabular_footnote::[If we are being very picky, tabular and relational data are not exactly the same. In chapter 5, when working with joining multiple data frames together, the differences will matter. When working with a single table, we can lump those two concepts together.], or row and columns) data, one of the most common formats popularized by SQL and Excel. This chapter and the next follow the same blueprint as we did with our first data analysis. We use some public Canadian television schedule data to identify and measure the proportion of commercials over the total programming.

4.1 What is tabular data?

4.1.1 How does PySpark represent tabular data?

4.2 PySpark for analyzing and processing tabular data

4.3 Reading and assessing delimited data in PySpark

4.3.1 A first pass at the `SparkReader` specialized for CSV

4.3.2 Customizing the `SparkReader` object to read CSV data files

4.3.3 Exploring the shape of our data universe

4.4 The basics of data manipulation: diagnosing our center table

4.4.1 Knowing what we want: selecting columns

4.4.2 Keeping what we need: deleting columns

4.4.3 Creating what’s not there: new columns with `withColumn()`

4.4.4 Tidying our data frame: renaming and re-ordering columns

4.4.5 Summarizing your data frame: `describe()` and `summary()`

4.5 Summary