In this chapter, you will build a mental model of Apache Spark. A mental model is an explanation of how something works in the real world, using your thought process and following diagrams. The goal of this chapter is to help you define your own ideas about the thought process I will walk you through. I will use a lot of diagrams and some code. It would be extremely pretentious to build a unique Spark mental model; this model will describe a typical scenario involving loading, processing, and saving data. You will also walk through the Java code for these operations.
The scenario you will follow involves distributed loading of a CSV file, performing a small operation, and saving the result in a PostgreSQL database (and Apache Derby). Knowing or installing PostgreSQL is not required to understand the example. If you are familiar with using other RDBMSs and Java, you will easily adapt to this example. Appendix F provides additional help with relational databases (tips, installation, links, and more).
Code and sample data are available on GitHub at https://github .com/jgperrin/net.jgp.books.spark.ch02 .