Chapter 4. Analyzing data with pyspark.sql
This chapter covers:
- Reading delimited data into a PySpark data frame
- Ingesting and exploring tabular or relational data
- Selecting, manipulating, renaming and deleting columns in a data frame
- Performing simple joins between two data frames
- Grouping data and computing summary on a data frame
So far, in Chapter 2 and 3, we’ve dealt with textual data, which is unstructured. Through a chain of transformations, we extracted some information in order to get the most common words in the text. This Chapter will go a little deeper into data manipulation using structured data, which is data that follow a set format. More specifically, we will work with tabular data, which follows the classical rows and columns layout. Just like the two previous chapters, we’ll take a data set and answer a simple question by exploring and processing the data.
We’ll use some public Canadian television schedule data to identify and measure the proportion of commercials over the total programming. The data used is typical of what you see from mainstream relational databases. We’ll build on our prior knowledge, push the envelope a little further, and by the end, you’ll know how to wrangle the most common type of data to answer your own questions.