This chapter covers
Spark is fast. It processes data easily across multiple nodes in a cluster or on your laptop. Spark also loves memory. That’s a key design for Spark’s performance. However, as your datasets grow from the sample that you use to develop applications to production datasets, you may feel that performance is going down.
In this chapter, you’ll get some foundational knowledge about how Spark uses memory. This knowledge will help you in optimizing your applications.
You will first use caching and checkpointing in an application with dummy data. This step will help you better understand the various modes you can use to optimize your applications.
You will then switch to a real-life example with real-life data. In this second lab, you will run analytical operations against a dataset containing economic information from Brazil.
Finally, you will read about other considerations when optimizing workloads. I also share some hints on increasing performance as well as pointers to go further.
Lab
Examples from this chapter are available in GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch16.