chapter four

4 Analyzing tabular data with pyspark.sql

 

This chapter covers:

  • Reading delimited data into a PySpark data frame
  • Understanding how PySpark represents tabular data in a data frame
  • Ingesting and exploring tabular or relational data
  • Selecting, manipulating, renaming and deleting columns in a data frame
  • Summarizing data frames for quick exploration

So far, in chapters 2 and 3, we’ve dealt with textual data, which is unstructured. Through a chain of transformations, we extracted some information to get the most common words in the text. This chapter will go a little deeper into data manipulation using structured data, which is data that follow a set format. More specifically, we will work with tabular data, which follows the classical rows and columns layout. Just like the two previous chapters, we’ll take a data set and answer a simple question by exploring and processing the data.

4.1  What is tabular data?

4.1.1  How does PySpark represent tabular data?

4.2  PySpark for analyzing and processing tabular data

4.3  Reading delimited data in PySpark

4.3.1  Customizing the SparkReader object to read CSV data files

4.3.2  Exploring the shape of our data universe

4.4  The basics of data manipulation: diagnosing our centre table

4.4.1  Knowing what we want: selecting columns

4.4.2  Keeping what we need: deleting columns

4.4.3  Creating what’s not there: new columns with withColumn()

4.4.4  Tidying our data frame: renaming and re-ordering columns

4.4.5  Summarizing your data frame: describe() and summary()

4.5  Summary