chapter eleven
This chapter covers
- How Spark uses CPU, RAM and hard drive resources
- Making better use memory resources to speed (or avoid slowing) down computations
- Using the Spark UI to review useful information about your Spark installation
- How Spark splits a job into stages and how to profile and monitor those stages
- Classifying transformations into narrow and wide operations and how to reason about them
- Using caching judiciously and avoiding unfortunate performance drop with improper caching
One of the best selling points of PySpark, as mentioned in chapter 1, is that you can treat your cluster like a single unit. Since the beginning of the book, we focused in ingesting, processing, and harnessing results of our data without caring much about where the processing happens. Spark provides a compelling abstraction over our cluster (or when working locally, our local driver).