2. Architecture and flow

This chapter covers

  • Building a mental model of Spark for a typical
    use case
  • Understanding the associated Java code
  • Exploring the general architecture of a Spark application
  • Understanding the flow of data

In this chapter, you will build a mental model of Apache Spark. A mental model is an explanation of how something works in the real world, using your thought process and following diagrams. The goal of this chapter is to help you define your own ideas about the thought process I will walk you through. I will use a lot of diagrams and some code. It would be extremely pretentious to build a unique Spark mental model; this model will describe a typical scenario involving loading, processing, and saving data. You will also walk through the Java code for these operations.

The scenario you will follow involves distributed loading of a CSV file, performing a small operation, and saving the result in a PostgreSQL database (and Apache Derby). Knowing or installing PostgreSQL is not required to understand the example. If you are familiar with using other RDBMSs and Java, you will easily adapt to this example. Appendix F provides additional help with relational databases (tips, installation, links, and more).


Code and sample data are available on GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch02 .

2.1 Building your mental model

2.2 Using Java code to build your mental model

2.3 Walking through your application

2.3.1 Connecting to a master

2.3.2 Loading, or ingesting, the CSV file

2.3.3 Transforming your data

2.3.4 Saving the work done in your dataframe to a database