part one

Part 1. The theory crippled by awesome examples

As with any technology, you need to understand a bit of the “boring” theory before you can deep dive into using it. I have managed to contain this part to six chapters, which will give you a good overview of the concepts, explained through examples.

Chapter 1 is an overall introduction with a simple example. You will learn why Spark is not just a simple set of tools, but a real distributed analytics operating system. After this first chapter, you will be able to run a simple data ingestion in Spark.

Chapter 2 will show you how Spark works, at a high level. You’ll build a representation of Spark’s components by building a mental model (representing your own thought process) step by step. This chapter’s lab shows you how to export data in a database. This chapter contains a lot of illustrations, which should make your learning process easer than just from words and code!

Chapter 3 takes you to a whole new dimension: discovering the powerful dataframe, which combines both the API and storage capabilities of Spark. In this chapter’s lab, you’ll load two datasets and union them together.

Chapter 4 celebrates laziness and explains why Spark uses lazy optimization. You’ll learn about the directed acyclic graph (DAG) and compare Spark and an RDBMS. The lab teaches you how to start manipulating data by using the dataframe API.