This chapter covers
- How Spark uses CPU, RAM and hard drive resources
- Using memory resources better to speed (or avoid slowing) down computations
- Using the Spark UI to review useful information about your Spark installation
- How Spark splits a job into stages and how to profile and monitor those stages
- Classifying transformations into narrow and wide operations and how to reason about them
- Using caching judiciously and avoiding unfortunate performance drop with improper caching
Imagine the following scenario: you write a readable, well thought-out PySpark program. When submitting your program to your Spark cluster, it runs.
You wait.
How can we peek under the hood and see the progression of our program? Troubleshoot which step is taking a lot of time? This chapter is about understanding how we can access information about our Spark instance, such as its configuration and layout (CPU, Memory, etc.). On top of this, we follow the execution of a program from raw Python code to optimized Spark instructions. This knowledge will remove a lot of magic from your program: you’ll be in a position to know what’s happening at every stage of your PySpark program. If your program takes too long, this chapter will give you where (and how) to look for the relevant information.