11 Faster PySpark: Understanding Spark’s query planning

 

This chapter covers

  • How Spark uses CPU, RAM, and hard drive resources
  • Using memory resources better to speed up (or avoid slowing down) computations
  • Using the Spark UI to review useful information about your Spark installation
  • How Spark splits a job into stages and how to profile and monitor those stages
  • Classifying transformations into narrow and wide operations and how to reason about them
  • Using caching judiciously and avoiding unfortunate performance drop with improper caching

Imagine the following scenario: you write a readable, well-thought-out PySpark program. When submitting your program to your Spark cluster, it runs. You wait.

How can we peek under the hood and see the progression of our program? Troubleshoot which step is taking a lot of time? This chapter is about understanding how we can access information about our Spark instance, such as its configuration and layout (CPU, memory, etc.). We also follow the execution of a program from raw Python code to optimized Spark instructions. This knowledge will remove a lot of magic from your program; you’ll be in a position to know what’s happening at every stage of your PySpark job. If your program takes too long, this chapter will show you where (and how) to look for the relevant information.

11.1 Open sesame: Navigating the Spark UI to understand the environment

 
 
 

11.1.1 Reviewing the configuration: The environment tab

 
 
 

11.1.2 Greater than the sum of its parts: The Executors tab and resource management

 

11.1.3 Look at what you’ve done: Diagnosing a completed job via the Spark UI

 
 
 

11.1.4 Mapping the operations via Spark query plans: The SQL tab

 
 
 

11.1.5 The core of Spark: The parsed, analyzed, optimized, and physical plans

 
 

11.2 Thinking about performance: Operations and memory

 
 
 
 

11.2.1 Narrow vs. wide operations

 

11.2.2 Caching a data frame: Powerful, but often deadly (for perf)

 
 
 

Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage