chapter four

4 Analyzing tabular data with pyspark.sql

This chapter covers:

Reading delimited data into a PySpark data frame
Understanding how PySpark represents tabular data in a data frame
Ingesting and exploring tabular or relational data
Selecting, manipulating, renaming and deleting columns in a data frame
Summarizing data frames for quick exploration

So far, in chapters 2 and 3, we’ve dealt with textual data, which is unstructured. Through a chain of transformations, we extracted some information to get the most common words in the text. This chapter will go a little deeper into data manipulation using structured data, which is data that follow a set format. More specifically, we will work with tabular data, which follows the classical rows and columns layout. Just like the two previous chapters, we’ll take a data set and answer a simple question by exploring and processing the data.

4.1 What is tabular data?

4.1.1 How does PySpark represent tabular data?

4.2 PySpark for analyzing and processing tabular data

4.3 Reading delimited data in PySpark

4.3.1 Customizing the `SparkReader` object to read CSV data files

4.3.2 Exploring the shape of our data universe

4.4 The basics of data manipulation: diagnosing our centre table

4.4.1 Knowing what we want: selecting columns

4.4.2 Keeping what we need: deleting columns

4.4.3 Creating what’s not there: new columns with `withColumn()`

4.4.4 Tidying our data frame: renaming and re-ordering columns

4.4.5 Summarizing your data frame: `describe()` and `summary()`

4.5 Summary