chapter one

1. So, what is Spark, anyway?

This chapter covers

What Apache Spark is and its use cases
Basics of distributed technology
The four pillars of Spark
Storage and APIs: love the dataframe

When I was a kid in the 1980s, discovering programming through Basic and my Atari, I could not understand why we could not automate basic law enforcement activities such as speed control, traffic-light violations, and parking meters. Everything seemed pretty easy: the book I had said that to be a good programmer, you should avoid GOTO statements. And that’s what I did, trying to structure my code from the age of 12. However, there was no way I could imagine the volume of data (and the booming Internet of Things, or IoT) while I was developing my Monopoly-like game. As my game fit into 64 KB of memory, I definitely had no clue that datasets would become bigger (by a ginormous factor) or that the data would have a speed, or velocity , as I was patiently waiting for my game to be saved on my Atari 1010 tape recorder.

1.1 The big picture: What Spark is and what it does

1.1.1 What is Spark?

1.1.2 The four pillars of mana

1.2 How can you use Spark?

1.2.1 Spark in a data processing/engineering scenario

1.2.2 Spark in a data science scenario

1.3 What can you do with Spark?

1.3.1 Spark predicts restaurant quality at NC eateries

1.3.2 Spark allows fast data transfer for Lumeris

1.3.3 Spark analyzes equipment logs for CERN

1.3.4 Other use cases

1.4 Why you will love the dataframe

1.4.1 The dataframe from a Java perspective

1.4.2 The dataframe from an RDBMS perspective

1.4.3 A graphical representation of the dataframe

1.5 Your first example

1.5.1 Recommended software

1.5.2 Downloading the code

1.5.3 Running your first application

1.5.4 Your first code

Summary