catalog books video projects audio free content register pBook

2 Your first data program in PySpark

This chapter covers

Launching and using the pyspark shell for interactive development
Reading and ingesting data into a data frame
Exploring data using the DataFrame structure
Selecting columns using the select() method
Reshaping single-nested data into distinct records using explode()
Applying simple functions to your columns to modify the data they contain
Filtering columns using the where() method

Data-driven applications, no matter how complex, all boil down to what we can think of as three meta steps, which are easy to distinguish in a program:

We start by loading or reading the data we wish to work with.
We transform the data, either via a few simple instructions or a very complex machine learning model.
We then export (or sink) the resulting data, either into a file or by summarizing our findings into a visualization.

2.1 Setting up the PySpark shell

2.1.1 The SparkSession entry point

2.1.2 Configuring how chatty spark is: The log level

2.2 Mapping our program

2.3 Ingest and explore: Setting the stage for data transformation

2.3.1 Reading data into a data frame with spark.read

2.3.2 From structure to content: Exploring our data frame with show()

2.4 Simple column transformations: Moving from a sentence to a list of words

2.4.1 Selecting specific columns using select()

2.4.2 Transforming columns: Splitting a string into a list of words

2.4.3 Renaming columns: alias and withColumnRenamed

2.4.4 Reshaping your data: Exploding a list into rows

2.4.5 Working with words: Changing case and removing punctuation

2.5 Filtering rows

Summary

sitemap

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }