16. Cache and checkpoint: Enhancing Spark’s performances


This chapter covers

  • Caching and checkpointing to enhance Spark’s performance
  • Choosing the right method to enhance performance
  • Collecting performance information
  • Picking the right spot to use a cache or checkpoint
  • Using collect() and collectAsList() wisely

Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark also loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample that you use to develop applications to production datasets, you may feel that performance is going down.

In this chapter, you’ll get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing your applications.

You will first use caching and checkpointing in an application with dummy data. This step will help you better understand the various modes you can use to optimize your applications.

You will then switch to a real-life example with real-life data. In this second lab, you will run analytical operations against a dataset containing economic information from Brazil.

Finally, you will read about other considerations when optimizing workloads. I also share some hints on increasing performance as well as pointers to go further.


Examples from this chapter are available in GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch16.

16.1 Caching and checkpointing can increase performance

16.1.1 The usefulness of Spark caching

16.1.2 The subtle effectiveness of Spark checkpointing

16.1.3 Using caching and checkpointing

16.2 Caching in action

16.3 Going further in performance optimization