11 Faster PySpark: understanding Spark’s query planning

This chapter covers

How Spark uses CPU, RAM and hard drive resources
Using memory resources better to speed (or avoid slowing) down computations
Using the Spark UI to review useful information about your Spark installation
How Spark splits a job into stages and how to profile and monitor those stages
Classifying transformations into narrow and wide operations and how to reason about them
Using caching judiciously and avoiding unfortunate performance drop with improper caching

Imagine the following scenario: you write a readable, well thought-out PySpark program. When submitting your program to your Spark cluster, it runs.

You wait.

How can we peek under the hood and see the progression of our program? Troubleshoot which step is taking a lot of time? This chapter is about understanding how we can access information about our Spark instance, such as its configuration and layout (CPU, Memory, etc.). On top of this, we follow the execution of a program from raw Python code to optimized Spark instructions. This knowledge will remove a lot of magic from your program: you’ll be in a position to know what’s happening at every stage of your PySpark program. If your program takes too long, this chapter will give you where (and how) to look for the relevant information.

11.1 Open sesame: navigating the Spark UI to understand the environment

11.1.1 Reviewing the configuration: the environment tab

11 Faster PySpark: understanding Spark’s query planning

This chapter covers

11.1 Open sesame: navigating the Spark UI to understand the environment

11.1.1 Reviewing the configuration: the environment tab

11.1.2 Greater than the sum of its parts: the "Executors" tab and resource management

11.1.3 Look at what you’ve done: diagnosing a completed job via the SparkUI

11.1.4 Mapping the operations via Spark query plans: the SQL tab

11.1.5 The core of Spark: the parsed, analyzed, optimized and physical plans

11.2 Thinking about performance: operations and memory

11.2.1 Narrow vs. wide operations

11.2.2 Caching a data frame: powerful, but often deadly (for perf)

11.3 Summary