2 Your first data program in PySpark

This chapter covers:

Launching and using the pyspark shell for interactive development
Reading and ingesting data into a data frame
Exploring data using the DataFrame structure
Selecting columns using the select() method
Filtering columns using the where() method
Applying simple functions to your columns to modify the data they contain
Reshaping singly-nested data into distinct records using explode()

Data-driven applications, no matter how complex, all boils down to what I like to call three meta-steps, which are easy to distinguish in a program.

We start by ingesting or reading the data we wish to work with.
We transform the data, either via a few simple instructions or a very complex machine learning model
We then export the resulting data, either into a file to be fed into an app or by summarizing our findings into a visualization.

The next two chapters will introduce a basic workflow with PySpark via the creation of a simple ETL (Extract, Transform and Load, which is a more business-speak way of saying Ingest, Transform and Export). We will spend most of our time at the pyspark shell, interactively building our program one step at a time. Just like normal Python development, using the shell or REPL (I’ll use the terms interchangeably) provides rapid feedback and quick progression. Once we are comfortable with the results, we will wrap our program so we can submit it in batch mode.

2.1 Setting up the pyspark shell

2.1.1 The `SparkSession` entry-point

2.1.2 Configuring how chatty spark is: the log level

2.2 Mapping our program

2.3 Reading and ingesting data into a data frame

2.4 Exploring data in the `DataFrame` structure

2.4.1 Peeking under the hood: the `show()` method

2.5 Moving from a sentence to a list of words

2.5.1 Selecting specific columns using `select()`

2.5.2 Transforming columns: splitting a string into a list of words

2.5.3 Renaming columns: `alias` and `withColumnRenamed`

2.6 Reshaping your data: exploding a list into rows

2.7 Working with words: changing case and removing punctuation

2.8 Filtering rows