chapter four

4 Analyzing tabular data with pyspark.sql

 

This chapter covers

  • Reading delimited data into a PySpark data frame
  • Understanding how PySpark represents tabular data in a data frame
  • Ingesting and exploring tabular or relational data
  • Selecting, manipulating, renaming and deleting columns in a data frame
  • Summarizing data frames for quick exploration

Our first example in chapters 2 and 3 worked with unstructured textual data. Each line of text was mapped to a record into a data frame and, through a series of transformations, we counted word frequencies from one (and multiple) text files. This chapter goes deeper into data transformation, this time using structured data. Data comes in many shape and forms: we start relational (or _tabular_footnote::[If we are being very picky, tabular and relational data are not exactly the same. In chapter 5, when working with joining multiple data frames together, the differences will matter. When working with a single table, we can lump those two concepts together.], or row and columns) data, one of the most common formats popularized by SQL and Excel. This chapter and the next follow the same blueprint as we did with our first data analysis. We use some public Canadian television schedule data to identify and measure the proportion of commercials over the total programming.

4.1 What is tabular data?

4.1.1 How does PySpark represent tabular data?

4.2 PySpark for analyzing and processing tabular data

4.3 Reading and assessing delimited data in PySpark

4.3.1 A first pass at the SparkReader specialized for CSV

4.3.2 Customizing the SparkReader object to read CSV data files

4.3.3 Exploring the shape of our data universe

4.4 The basics of data manipulation: diagnosing our center table

4.4.1 Knowing what we want: selecting columns

4.4.2 Keeping what we need: deleting columns

4.4.3 Creating what’s not there: new columns with withColumn()

4.4.4 Tidying our data frame: renaming and re-ordering columns

4.4.5 Summarizing your data frame: describe() and summary()

4.5 Summary