This chapter covers
- Launching and using the pyspark shell for interactive development
- Reading and ingesting data into a data frame
- Exploring data using the DataFrame structure
- Selecting columns using the select() method
- Reshaping single-nested data into distinct records using explode()
- Applying simple functions to your columns to modify the data they contain
- Filtering columns using the where() method
Data-driven applications, no matter how complex, all boil down to what we can think of as three meta steps, which are easy to distinguish in a program:
- We start by loading or reading the data we wish to work with.
- We transform the data, either via a few simple instructions or a very complex machine learning model.
- We then export (or sink) the resulting data, either into a file or by summarizing our findings into a visualization.