chapter eleven

11 Faster PySpark: understanding Spark’s query planning

This chapter covers

How Spark uses CPU, RAM and hard drive resources
Making better use memory resources to speed (or avoid slowing) down computations
Using the Spark UI to review useful information about your Spark installation
How Spark splits a job into stages and how to profile and monitor those stages
Classifying transformations into narrow and wide operations and how to reason about them
Using caching judiciously and avoiding unfortunate performance drop with improper caching

One of the best selling points of PySpark, as mentioned in chapter 1, is that you can treat your cluster like a single unit. Since the beginning of the book, we focused in ingesting, processing, and harnessing results of our data without caring much about where the processing happens. Spark provides a compelling abstraction over our cluster (or when working locally, our local driver).

11.1 Open sesame: navigating the Spark UI to understand the environment

11.1.1 Reviewing the configuration: the environment tab

11.1.2 Greater than the sum of its parts: the "Executors" tab and resource management

11.1.3 Look at what you’ve done: diagnosing a completed job via the SparkUI

11.1.4 Mapping the operations via Spark query plans: the SQL tab

11.1.5 The core of Spark: the parsed, analyzed, optimized and physical plans

11.2 Thinking about performance: operations and memory

11.2.1 Narrow vs. wide operations

11.2.2 Caching a data frame: powerful, but often deadly (for perf)

11.3 Summary