This chapter covers:
- Launching and using the
pyspark
shell for interactive development
- Reading and ingesting data into a data frame
- Exploring data using the
DataFrame
structure
- Selecting columns using the
select()
method
- Filtering columns using the
where()
method
- Applying simple functions to your columns to modify the data they contain
- Reshaping singly-nested data into distinct records using
explode()
Data-driven applications, no matter how complex, all boils down to what I like to call three meta-steps, which are easy to distinguish in a program.
- We start by ingesting or reading the data we wish to work with.
- We transform the data, either via a few simple instructions or a very complex machine learning model
- We then export the resulting data, either into a file to be fed into an app or by summarizing our findings into a visualization.
The next two chapters will introduce a basic workflow with PySpark via the creation of a simple ETL (Extract, Transform and Load, which is a more business-speak way of saying Ingest, Transform and Export). We will spend most of our time at the pyspark
shell, interactively building our program one step at a time. Just like normal Python development, using the shell or REPL (I’ll use the terms interchangeably) provides rapid feedback and quick progression. Once we are comfortable with the results, we will wrap our program so we can submit it in batch mode.