chapter three

Chapter 3. Writing Spark applications

This chapter covers

Generating a new Spark project in Eclipse
Loading a sample dataset from the GitHub archive
Writing an application that analyzes GitHub logs
Working with DataFrames in Spark
Submitting your application to be executed

In this chapter, you’ll learn to write Spark applications. Most Spark programmers use an integrated development environment (IDE), such as IntelliJ or Eclipse. There are readily available resources online that describe how to use IntelliJ IDEA with Spark, whereas Eclipse resources are still hard to come by. That is why, in this chapter, you’ll learn how to use Eclipse for writing Spark programs. Nevertheless, if you choose to stick to IntelliJ, you’ll still be able to follow along. After all, these two IDEs have similar sets of features.

You’ll start by downloading and configuring Eclipse and then installing Eclipse plug-ins that are necessary for working with Scala. You’ll use Apache Maven (a software project-management tool) to configure Spark application projects in this chapter. The Spark project itself is configured using Maven. We prepared a Maven Archetype (a template for quickly bootstrapping Maven projects) in the book’s GitHub repository at https://github.com/spark-in-action, which will help you bootstrap your new Spark application project in just a few clicks.

Chapter 3. Writing Spark applications

This chapter covers

3.1. Generating a new Spark project in Eclipse

3.2. Developing the application

3.3. Submitting the application

3.4. Summary