Chapter 9. Performance and monitoring
This chapter covers
- Monitoring Spark applications
- Performance-related configuration options
- Tuning your application for maximum performance
- Using graph partitioning to boost large-scale processing
Most of the examples we’ve looked at so far have been small-scale. They would run on one machine and complete their processing without requiring a large amount of computing resources. But one of the key reasons to use Apache Spark is to take advantage of its distributed processing model. Spark’s ability to distribute data and processing across a cluster of many machines is the key to its capacity to run the type of processing we’ve discussed on large datasets.
Once you have a cluster with plenty of resources and have installed Apache Spark, getting your Spark application to run on a large dataset is still likely to require some planning, configuration, and possibly some troubleshooting. In this chapter, we take you through the steps necessary to run your application successfully and discuss where to go for troubleshooting information if things don’t go according to plan. In the course of this we’ll provide you with a deeper understanding of the Spark processing model, which will be essential to knowing which of the many configuration “knobs” need twiddling.