16 Cache and checkpoint: enhancing Spark’s performances
This chapter covers
- Caching and checkpointing to enhance Spark’s performance.
- Choosing the right method to enhance performance.
- Collecting performance information.
- Picking the right spot to use a cache or checkpoint.
- Using collect() and collectAsList() wisely.
Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample you use to develop applications to production datasets, you may feel that performances are going down.
In this chapter, you get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing your applications.
You will first use caching and checkpointing in an application with dummy data. This step will help you understand better the different optimization modes you can use to optimize your applications.
You will then switch to a real-life example with real-life data. In this second lab, you will run some analytical operations against a dataset containing economic information from Brazil.
Finally, you will read about other considerations when optimizing some workloads. I also share some hints on increasing performance as well as pointers to go further.
Lab
Examples from this chapter are available in GitHub at https://github.com/jgperrin/net.jgp.books.spark.ch16.