Chapter 10. Running Spark
This chapter covers
- Spark runtime components
- Spark cluster types
- Job and resource scheduling
- Configuring Spark
- Spark web UI
- Running Spark on the local machine
In previous chapters, we mentioned different ways to run Spark. In this and the next two chapters, we’ll discuss ways to set up a Spark cluster. A Spark cluster is a set of interconnected processes, usually running in a distributed manner on different machines. The main cluster types that Spark runs on are YARN, Mesos, and Spark standalone. Two other runtime options, local mode and local cluster mode, although the easiest and quickest methods of setting up Spark, are used mainly for testing purposes. The local mode is a pseudo-cluster running on a single machine, and the local cluster mode is a Spark standalone cluster that’s also confined to a single machine. If all this sounds confusing, don’t worry. We’ll explain these concepts in this chapter one step at a time.
In this chapter, we’ll also describe common elements of the Spark runtime architecture that apply to all the Spark cluster types. For example, driver and executor processes, as well as Spark context and scheduler objects, are common to all Spark runtime modes. Job and resource scheduling also function similarly on all cluster types, as do usage and configuration for the Spark web UI, used to monitor the execution of Spark jobs.