chapter four

4. Fundamentally lazy

This chapter covers

Using Spark’s efficient laziness to your benefit
Building a data application the traditional way vs. the Spark way
Building great data-centric applications using Spark
Learning more about transformations and actions
Using Catalyst, Spark’s built-in optimizer
Introducing directed acyclic graphs

This chapter is not only about celebrating laziness. It also teaches, through examples and experiments, the fundamental differences between building a data application the traditional way and building one with Spark.

There are at least two kinds of laziness: sleeping under the trees when you’ve committed to doing something else, and thinking ahead in order to do your job in the smartest possible way. Although, at this precise moment, my mind is thinking of lying in the shade of a tree, largely inspired by Asterix in Corsica , in this chapter I will show how Spark makes your life easier by optimizing its workload. You will learn about the essential roles of transformations (each step of the data process) and actions (the trigger to get the work done).

You will work on a real dataset from the US National Center for Health Statistics. The application is designed to illustrate the reasoning that Spark goes through when it processes data. The chapter focuses on only one application, but it contains three execution modes, which correspond to three experiments that you will run to get a better sense of Spark’s “way of thinking.”

4.1 A real-life example of efficient laziness

4.2 A Spark example of efficient laziness

4.2.1 Looking at the results of transformations and actions

4. Fundamentally lazy

This chapter covers

4.1 A real-life example of efficient laziness

4.2 A Spark example of efficient laziness

4.2.1 Looking at the results of transformations and actions

4.2.2 The transformation process, step by step

4.2.3 The code behind the transformation/action process

4.2.4 The mystery behind the creation of 7 million datapoints in 182 ms

The mystery behind the timing of actions

4.3 Comparing to RDBMS and traditional applications

4.3.1 Working with the teen birth rates dataset

4.3.2 Analyzing differences between a traditional app and a Spark app

4.4 Spark is amazing for data-focused applications

4.5 Catalyst is your app catalyzer

Summary