2 Your first data program in PySpark

 

This chapter covers:

  • Launching and using the pyspark shell for interactive development
  • Reading and ingesting data into a data frame
  • Exploring data using the DataFrame structure
  • Selecting columns using the select() method
  • Filtering columns using the where() method
  • Applying simple functions to your columns to modify the data they contain
  • Reshaping singly-nested data into distinct records using explode()

Data-driven applications, no matter how complex, all boils down to what I like to call three meta-steps, which are easy to distinguish in a program.

  1. We start by ingesting or reading the data we wish to work with.
  2. We transform the data, either via a few simple instructions or a very complex machine learning model
  3. We then export the resulting data, either into a file to be fed into an app or by summarizing our findings into a visualization.

The next two chapters will introduce a basic workflow with PySpark via the creation of a simple ETL (Extract, Transform and Load, which is a more business-speak way of saying Ingest, Transform and Export). We will spend most of our time at the pyspark shell, interactively building our program one step at a time. Just like normal Python development, using the shell or REPL (I’ll use the terms interchangeably) provides rapid feedback and quick progression. Once we are comfortable with the results, we will wrap our program so we can submit it in batch mode.

2.1  Setting up the pyspark shell

 

2.1.1  The SparkSession entry-point

 
 
 
 

2.1.2  Configuring how chatty spark is: the log level

 
 
 

2.2  Mapping our program

 
 
 

2.3  Reading and ingesting data into a data frame

 
 
 

2.4  Exploring data in the DataFrame structure

 

2.4.1  Peeking under the hood: the show() method

 
 
 
 

2.5  Moving from a sentence to a list of words

 
 

2.5.1  Selecting specific columns using select()

 
 
 

2.5.2  Transforming columns: splitting a string into a list of words

 
 

2.5.3  Renaming columns: alias and withColumnRenamed

 

2.6  Reshaping your data: exploding a list into rows

 
 
 

2.7  Working with words: changing case and removing punctuation

 
 
 
 

2.8  Filtering rows

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage