chapter four
This chapter covers:
- Reading delimited data into a PySpark data frame
- Understanding how PySpark represents tabular data in a data frame
- Ingesting and exploring tabular or relational data
- Selecting, manipulating, renaming and deleting columns in a data frame
- Summarizing data frames for quick exploration
So far, in chapters 2 and 3, we’ve dealt with textual data, which is unstructured. Through a chain of transformations, we extracted some information to get the most common words in the text. This chapter will go a little deeper into data manipulation using structured data, which is data that follow a set format. More specifically, we will work with tabular data, which follows the classical rows and columns layout. Just like the two previous chapters, we’ll take a data set and answer a simple question by exploring and processing the data.