2 Your first data program in PySpark

 

This chapter covers

  • Launching and using the pyspark shell for interactive development
  • Reading and ingesting data into a data frame
  • Exploring data using the DataFrame structure
  • Selecting columns using the select() method
  • Reshaping single-nested data into distinct records using explode()
  • Applying simple functions to your columns to modify the data they contain
  • Filtering columns using the where() method

Data-driven applications, no matter how complex, all boil down to what we can think of as three meta steps, which are easy to distinguish in a program:

  1. We start by loading or reading the data we wish to work with.
  2. We transform the data, either via a few simple instructions or a very complex machine learning model.
  3. We then export (or sink) the resulting data, either into a file or by summarizing our findings into a visualization.

2.1 Setting up the PySpark shell

 
 
 
 

2.1.1 The SparkSession entry point

 
 
 

2.1.2 Configuring how chatty spark is: The log level

 
 
 

2.2 Mapping our program

 
 
 
 

2.3 Ingest and explore: Setting the stage for data transformation

 
 
 

2.3.1 Reading data into a data frame with spark.read

 
 
 

2.3.2 From structure to content: Exploring our data frame with show()

 
 
 
 

2.4 Simple column transformations: Moving from a sentence to a list of words

 
 
 

2.4.1 Selecting specific columns using select()

 
 
 
 

2.4.2 Transforming columns: Splitting a string into a list of words

 
 
 

2.4.3 Renaming columns: alias and withColumnRenamed

 
 

2.4.4 Reshaping your data: Exploding a list into rows

 
 
 

2.4.5 Working with words: Changing case and removing punctuation

 
 

2.5 Filtering rows

 
 

Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage