Chapter 2. Spark fundamentals

This chapter covers

Exploring the spark-in-action VM
Managing multiple Spark versions
Getting to know Spark’s command line interface (spark-shell)
Playing with simple examples in spark-shell
Exploring RDD actions and transformations and double functions

It’s finally time to get down to business. In this chapter, you’ll start using the VM we prepared for you and write your first Spark programs. All you need is a laptop or a desktop machine with a usable internet connection and the prerequisites described in chapter 1.

To avoid overwhelming you this early in the book with various options for running Spark, for now you’ll be using the so-called Spark standalone local cluster. Standalone means Spark is using its own cluster manager (rather than Mesos or Hadoop’s YARN). Local means the whole system is running locally—that is, on your laptop or a desktop machine. We’ll talk extensively about Spark running modes and deployment options in the second part of the book. Strap in: things are about to get real!

Rest assured, we aren’t assuming any prior Spark or Scala knowledge; in this chapter, you’ll start slowly and progress step-by-step, tutorial style, through the process of setting up prerequisites, downloading and installing Spark, and playing with simple code examples in spark-shell (used for accessing Spark from the command prompt).

Chapter 2. Spark fundamentals

This chapter covers

2.1. Using the spark-in-action VM

2.2. Using Spark shell and writing your first Spark program

2.3. Basic RDD actions and transformations

2.4. Double RDD functions

2.5. Summary