chapter sixteen

16 Cache and checkpoint: enhancing Spark’s performances

This chapter covers

Caching and checkpointing to enhance Spark’s performance.
Choosing the right method to enhance performance.
Collecting performance information.
Picking the right spot to use a cache or checkpoint.
Using collect() and collectAsList() wisely.

Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample you use to develop applications to production datasets, you may feel that performances are going down.

In this chapter, you get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing your applications.

You will first use caching and checkpointing in an application with dummy data. This step will help you understand better the different optimization modes you can use to optimize your applications.

You will then switch to a real-life example with real-life data. In this second lab, you will run some analytical operations against a dataset containing economic information from Brazil.

Finally, you will read about other considerations when optimizing some workloads. I also share some hints on increasing performance as well as pointers to go further.

Lab

Examples from this chapter are available in GitHub at https://github.com/jgperrin/net.jgp.books.spark.ch16.

16 Cache and checkpoint: enhancing Spark’s performances

This chapter covers

Lab

16.1 Caching and checkpointing can increase performance

16.1.1 The usefulness of Spark caching

16.1.2 The subtle effectiveness of Spark checkpointing

16.1.3 Using cache and checkpoint

16.2 Caching in action

16.3 Going further in performance optimization

16.4 Summary