This chapter covers
- Reading delimited data into a PySpark data frame
- Understanding how PySpark represents tabular data in a data frame
- Ingesting and exploring tabular or relational data
- Selecting, manipulating, renaming and deleting columns in a data frame
- Summarizing data frames for quick exploration
Our first example in chapters 2 and 3 worked with unstructured textual data. Each line of text was mapped to a record into a data frame and, through a series of transformations, we counted word frequencies from one (and multiple) text files. This chapter goes deeper into data transformation, this time using structured data. Data comes in many shape and forms: we start relational (or _tabular_footnote::[If we are being very picky, tabular and relational data are not exactly the same. In chapter 5, when working with joining multiple data frames together, the differences will matter. When working with a single table, we can lump those two concepts together.], or row and columns) data, one of the most common formats popularized by SQL and Excel. This chapter and the next follow the same blueprint as we did with our first data analysis. We use some public Canadian television schedule data to identify and measure the proportion of commercials over the total programming.