Chapter 10. Running Spark

This chapter covers

Spark runtime components
Spark cluster types
Job and resource scheduling
Configuring Spark
Spark web UI
Running Spark on the local machine

In previous chapters, we mentioned different ways to run Spark. In this and the next two chapters, we’ll discuss ways to set up a Spark cluster. A Spark cluster is a set of interconnected processes, usually running in a distributed manner on different machines. The main cluster types that Spark runs on are YARN, Mesos, and Spark standalone. Two other runtime options, local mode and local cluster mode, although the easiest and quickest methods of setting up Spark, are used mainly for testing purposes. The local mode is a pseudo-cluster running on a single machine, and the local cluster mode is a Spark standalone cluster that’s also confined to a single machine. If all this sounds confusing, don’t worry. We’ll explain these concepts in this chapter one step at a time.

In this chapter, we’ll also describe common elements of the Spark runtime architecture that apply to all the Spark cluster types. For example, driver and executor processes, as well as Spark context and scheduler objects, are common to all Spark runtime modes. Job and resource scheduling also function similarly on all cluster types, as do usage and configuration for the Spark web UI, used to monitor the execution of Spark jobs.

Chapter 10. Running Spark

This chapter covers

10.1. An overview of Spark’s runtime architecture

10.2. Job and resource scheduling

10.3. Configuring Spark

10.4. Spark web UI

10.5. Running Spark on the local machine

10.6. Summary