When I was a kid in the 1980s, discovering programming through Basic and my Atari, I could not understand why we could not automate basic law enforcement activities such as speed control, traffic-light violations, and parking meters. Everything seemed pretty easy: the book I had said that to be a good programmer, you should avoid GOTO
statements. And that’s what I did, trying to structure my code from the age of 12. However, there was no way I could imagine the volume of data (and the booming Internet of Things, or IoT) while I was developing my Monopoly-like game. As my game fit into 64 KB of memory, I definitely had no clue that datasets would become bigger (by a ginormous factor) or that the data would have a speed, or velocity , as I was patiently waiting for my game to be saved on my Atari 1010 tape recorder.
A short 35 years later, all those use cases I imagined seem accessible (and my game, futile). Data has been growing at a faster pace than the hardware technology to support it.1 A cluster of smaller computers can cost less than one big computer. Memory is cheaper by half compared to 2005, and memory in 2005 was five times cheaper than in 2000.2 Networks are many times faster, and modern datacenters offer speeds of up to 100 gigabits per second (Gbps), nearly 2,000 times faster than your home Wi-Fi from five years ago. These were some of the factors that drove people to ask this question: How can I use distributed memory computing to analyze large quantities of data?
When you read the literature or search the web for information about Apache Spark, you may find that it is a tool for big data, a successor to Hadoop, a platform for doing analytics, a cluster-computer framework, and more. Que nenni! 3
Lab
The lab in this chapter is available in GitHub at https://github.com/ jgperrin/net.jgp.books.spark.ch01 . This is lab #400. If you are not familiar with GitHub and Eclipse, appendixes A, B, C, and D provide guidance.
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

As the Little Prince would say to Antoine de Saint-Exupéry, Draw me a Spark . In this section, you will first look at what Spark is, and then at what Spark can do through several use cases. This first section concludes by describing how Spark is integrated as a software stack and used by data scientists.
Spark is more than just a software stack for data scientists. When you build applications, you build them on top of an operating system, as illustrated in figure 1.1. The operating system provides services to make your application development easier; in other words, you are not building a filesystem or network driver for each application you develop.
Figure 1.1 When you write applications, you use services offered by the operating system, which abstracts
you from the hardware.

With the need for more computing power came an increased need for distributed computing. With the advent of distributed computing, a distributed application had to incorporate those distribution functions. Figure 1.2 shows the increased complexity of adding more components to your application.
Figure 1.2 One way to write distributed data-oriented applications is to embed all controls at the application level, using libraries or other artifacts. As a result, the applications become fatter and more difficult to maintain.

Having said all that, Apache Spark may appear like a complex system that requires you to have a lot of prior knowledge. I am convinced that you need only Java and relational database management system (RDBMS) skills to understand, use, build applications with, and extend Spark.
Applications have also become smarter, producing reports and performing data analysis (including data aggregation, linear regression, or simply displaying donut charts). Therefore, when you want to add such analytics capabilities to your application, you have to link libraries or build your own. All this makes your application bigger (or fatter , as in a fat client), harder to maintain, more complex, and, as a consequence, more expensive for the enterprise.
“So why wouldn’t you put those functionalities at the operating system level?” you may ask. The benefits of putting those features at a lower level, like the operating system, are numerous and include the following:
- Provides a standard way to deal with data (a bit like Structured Query Language, or SQL, for relational databases).
- Lowers the cost of development (and maintenance) of applications.
- Enables you to focus on understanding how to use the tool, not on how the tool works. (For example, Spark performs distributed ingestion, and you can learn how to benefit from that without having to fully grasp the way Spark accomplishes the task.)
And this is exactly what Spark has become for me: an analytics operating system . Figure 1.3 shows this simplified stack.
Figure 1.3 Apache Spark simplifies the development of analytics-oriented applications by offering services to applications, just as an operating system does.
In this chapter, you’ll discover a few use cases of Apache Spark for different industries and various project sizes. These examples will give you a small overview of what you can achieve.
I am a firm believer that, to get a better understanding of where we are, we should look at history. And this applies to information technology (IT) too: read appendix E if you want my take on it.
Now that the scene is set, you will dig into Spark. We will start from a global overview, have a look at storage and APIs, and, finally, work through your first example.
According to Polynesians, mana is the power of the elemental forces of nature embodied in an object or person. This definition fits the classic diagram you will find in all Spark documentation, showing four pillars bringing these elemental forces to Spark: Spark SQL, Spark Streaming, Spark MLlib (for machine learning), and GraphX sitting on top of Spark Core. Although this is an exact representation of the Spark stack, I find it limiting. The stack needs to be extended to show the hardware, the operating system, and your application, as in figure 1.4.
Figure 1.4 Your application, as well as other applications, are talking to Spark’s four pillars--SQL, streaming, machine learning, and graphs--via a unified API. Spark shields you from the operating system and the hardware constraints: you will not have to worry about where your application is running or if it has the right data. Spark will take care of that. However, your application can still access the operating system or hardware if it needs to.
Of course, the cluster(s) where Spark is running may not be used exclusively by your application, but your work will use the following:
- Spark SQL to run data operations, like traditional SQL jobs in an RDBMS. Spark SQL offers APIs and SQL to manipulate your data. You will discover Spark SQL in chapter 11 and read more about it in most of the chapters after that. Spark SQL is a cornerstone of Spark.
- Spark Streaming , and specifically Spark structured streaming, to analyze streaming data. Spark’s unified API will help you process your data in a similar way, whether it is streamed data or batch data. You will learn the specifics about streaming in chapter 10.
- Spark MLlib for machine learning and recent extensions in deep learning. Machine learning, deep learning, and artificial intelligence deserve their own book.
- GraphX to exploit graph data structures. To learn more about GraphX, you can read Spark GraphX in Action by Michael Malak and Robin East (Manning, 2016).
discuss

In this section, you’ll take a detailed look at how you can use Apache Spark by focusing on typical data processing scenarios as well as a data science scenario. Whether you are a data engineer or a data scientist, you will be able to use Apache Spark in your job.
Spark can process your data in a number of different ways. But it excels when it plays in a big data scenario, where you ingest data, clean it, transform it, and republish it.
I like to see data engineers as data preparers and data logisticians. They make sure the data is available, that the data quality rules are applied successfully, that the transformations are executed successfully, and that the data is available to other systems or departments, including business analysts and data scientists. Data engineers can also be the ones taking the work of the data scientists and industrializing it.
Spark is a perfect tool for data engineers. The four steps of a typical Spark (big data) scenario performed by data engineering are as follows:
Figure 1.5 Spark in a typical data processing scenario. The first step is ingesting the data. At this stage, the data is raw; you may next want to apply some data quality (DQ). You are now ready to transform your data. Once you have transformed your data, it is richer. It is time to publish or share it so people in your organization can perform actions on it and make decisions based on it.

- Ingesting data --Spark can ingest data from a variety of sources (see chapters 7, 8, and 9 on ingestion). If you can’t find a supported format, you can build your own data sources. I call data at this stage raw data . You can also find this zone named the staging , landing , bronze , or even swamp zone .
- Improving data quality (DQ) --Before processing your data, you may want to check the quality of the data itself. An example of DQ is to ensure that all birth dates are in the past. As part of this process, you can also elect to obfuscate some data: if you are processing Social Security numbers (SSNs) in a health-care environment, you can make sure that the SSNs are not accessible to developers or nonauthorized personnel.4 After your data is refined, I call this stage the pure data zone. You may also find this zone called the refinery , silver , pond , sandbox , or exploration zone .
- Transforming data --The next step is to process your data. You can join it with other datasets, apply custom functions, perform aggregations, implement machine learning, and more. The goal of this step is to get rich data , the fruit of your analytics work. Most of the chapters discuss transformation. This zone may also be called the production , gold , refined , lagoon , or operationalization zone .
- Loading and publishing --As in an ETL process,5 you can finish by loading the data into a data warehouse, using a business intelligence (BI) tool, calling APIs, or saving the data in a file. The result is actionable data for your enterprise.
Data scientists have a slightly different approach than software engineers or data engineers, as data scientists focus on the transformation part, in an interactive manner. For this purpose, data scientists use different tools, such as notebooks. Names of notebooks include Jupyter, Zeppelin, IBM Watson Studio, and Databricks Runtime.
How data scientists work will definitely matter to you, as data science projects will consume enterprise data, and therefore you may end up delivering data to the data scientists, off-loading their work (such as machine learning models) into enterprise data stores, or industrializing their findings.
Therefore, a UML-like sequence diagram, as in figure 1.6, will explain a little better how data scientists use Spark.
- PySpark in Action by Jonathan Rioux (Manning, 2020, www.manning.com/ books/pyspark-in-action?a_aid=jgp ).
- Mastering Large Datasets with Python by John T. Wolohan (Manning, 2020, www.manning.com/books/mastering-large-datasets-with-python?a_aid=jgp ).
Figure 1.6 Sequence diagram for a data scientist using Spark: the user “talks” to the notebook, which calls Spark when needed. Spark directly handles ingestion. Each square represents a step, and each arrow represents a sequence. The diagram should be read chronologically, starting from the top.

In the use case described in figure 1.6, the data is loaded in Spark, and then the user will play with it, apply transformations, and display part of the data. Displaying the data is not the end of the process. The user will be able to continue in an interactive manner, as in a physical notebook, where you write recipes, take notes, and so on. At the end, the notebook user can save the data to files or databases, or produce (interactive) reports.
settings

Spark is used in various kinds of projects, so let’s explore a handful of them. All use cases involve data that cannot fit or be processed on a single computer (aka big data) and therefore require a cluster of computers--hence the need for a distributed operating system, specialized in analytics.
The definition of big data has evolved over time, from data with characteristics known as the five Vs6 to “data that cannot fit on a single computer.” I dislike this definition; as you probably know, many RDBMSs split data over several servers. As with many concepts, you may have to make your own definition. This book will hopefully help you.
For me, big data is the collection of datasets, available everywhere in the enterprise, aggregated in a single location, on which you can run basic analytics to more advanced analytics, like machine and deep learning. Those bigger datasets can become the basis for artificial intelligence (AI). Technologies, size, or number of computers are irrelevant to this concept.
Spark, through its analytics features and natively distributed architecture, can address big data, whether or not you think it is big, or whether it fits in one or many computers. Simply remember that the traditional report output on a 132-column dot-matrix printer is not a typical use case for Spark. Let’s discover a few real-world examples.
In most of the United States, restaurants require inspections by local health departments in order to operate and are graded based on these inspections. A higher grade does not signify better food, but it might give an indication of whether you are going to die after you have BBQ in a certain shack on your trip to the South. Grades measure the cleanliness of the kitchen, how safely the food is stored, and many more criteria to (hopefully) avoid food-borne illnesses.
In North Carolina, restaurants are graded on a scale of 0 to 100. Each county offers access to the restaurant’s grade, but there is no central location for accessing the information statewide.
NCEatery.com is a consumer-oriented website that list restaurants with their inspection grades over time. The ambition of NCEatery.com is to centralize this information and to run predictive analytics on restaurants to see if we can discover patterns in restaurant quality. Is this place I loved two years ago going downhill?
In the backend of the website, Apache Spark ingests datasets of restaurants, inspections, and violations data coming from different counties, crunches the data, and publishes a summary on the website. During the crunching phase, several data quality rules are applied, as well as machine learning to try to project inspections and scores. Spark processes 1.6 × 1021 datapoints and publishes about 2,500 pages every 18 hours using a small cluster. This ongoing project is in the process of onboarding more NC counties.
Lumeris is an information-based health-care services company, based in St. Louis, Missouri. It has traditionally helped health-care providers get more insights from their data. The company’s state-of-the-art IT system needed a boost to accommodate more customers and drive more powerful insights from the data it had.
At Lumeris, as part of the data engineering processes, Apache Spark ingests thousands of comma-separated values (CSV) files stored on Amazon Simple Storage Service (S3), builds health-care-compliant HL7 FHIR resources,7 and saves them in a specialized document store where they can be consumed by both the existing applications and a new generation of client applications.
This technology stack allows Lumeris to continue its growth, both in terms of processed data and applications. Down the road, with the help of this technology, Lumeris aims to save lives.
CERN, or the European Organization for Nuclear Research, was founded in 1954. It is home to the Large Hadron Collider (LHC), a 27-kilometer ring located 100 meters under the border between France and Switzerland, in Geneva.
The giant physics experiments run there generate 1 petabyte (PB) of data per second. After significant filtering, the data is reduced to 900 GB per day.
After experiments with Oracle, Impala, and Spark, the CERN team designed the Next CERN Accelerator Logging Service (NXCALS) around Spark on an on-premises cloud running OpenStack with up to 250,000 cores. The consumers of this impressive architecture are scientists (through custom applications and Jupyter notebooks), developers, and apps. The ambition for CERN is to onboard even more data and increase the overall velocity of data processing.
- Building interactive data-wrangling tools, such as IBM’s Watson Studio and Databricks’ notebooks
- Monitoring the quality of video feeds for TV channels like MTV or Nickelodeon8
- Monitoring online video game players for bad behavior and adjusting the player interaction in quasi real-time to maximize all players’ positive experiences, via the company Riot Games
highlight, annotate, and bookmark
You can automatically highlight by performing the text selection while keeping the alt/ key pressed.

In this section, my goal is to make you love the dataframe. You will learn just enough to want to discover more, which you will do as you explore deeper in chapter 3 and throughout the book. A dataframe is both a data container and an API.
The concept of a dataframe is essential to Spark. However, the concept is not difficult to understand. You will use dataframes all the time. In this section, you will look at what a dataframe is from a Java (software engineer) and an RDBMS (data engineer) perspective. Once you are familiar with some of the analogy, I will wrap up with a diagram.
In most literature, you will find a different spelling for dataframe: DataFrame. I decided to settle for the most English way of writing it, which, I concur, may be odd for a French native. Nevertheless, and despite its majestic grandeur, the dataframe remains a common noun, so no reason to play with uppercase letters here and there. It’s not a burger place!
If your background is in Java and you have some Java Database Connectivity (JDBC) experience, the dataframe will look like a ResultSet
. It contains data; it has an API . . .
- You do not browse through it with a
next()
method. - Its API is extensible through user-defined functions (UDFs). You can write or wrap existing code and add it to Spark. This code will then be accessible in a distributed mode. You will study UDFs in chapter 16.
- If you want to access the data, you first get the
Row
and then go through the columns of the row with getters (similar to aResultSet
). - Metadata is fairly basic, as there are no primary or foreign keys or indexes in Spark.
If you come more from an RDBMS background, you may find that a dataframe is like a table. The following are similarities:
- Data can be nested, as in a JSON or XML document. Chapter 7 describes ingestion of those documents, and you will use those nested constructs in chapter 13.
- You don’t update or delete entire rows; you create new dataframes.
- You can easily add or remove columns.
- There are no constraints, indices, primary or foreign keys, or triggers on the dataframe.
The dataframe is a powerful tool you will use throughout the book and your journey with Spark. Its powerful API and storage capabilities make it the key element from which everything radiates. Figure 1.7 shows one way to imagine the API, implementation, and storage.
Figure 1.7 A graphical representation of the dataframe, its implementation in Java ( Dataset<Row>
), the schema, and partitioned storage. As a developer, you will use the dataset API, which will allow you to manipulate columns and rows. The storage in partitions can also be accessed, mainly for optimization; you will learn more about partitions in chapter 2.

discuss

It is time for your first example. Your goal is to run Spark with a simple application that will read a file, store its content in a dataframe, and display the result. You will learn how to set up your working environment, which you will be using throughout the book. You will also learn how to interact with Spark and do basic operations.
As you will discover, most chapters contain dedicated labs, which you can hack to experiment with the code. Each lab has a dataset (as often as possible, a real-life one) as well as one or more code listings.
- \Install basic software, which you may already have: Git, Maven, Eclipse.
- Download the code by cloning it from GitHub.
- Execute the example, which will load a basic CSV file and display some rows.
This section provides the list of software you will be using throughout the book. Detailed installation instructions for the required software are in appendixes A and B.
- Apache Spark 3.0.0.
- Mainly macOS Catalina, but examples also run on Ubuntu 14 to 18 and on Windows 10.
- Java 8 (although you won’t use a lot of the constructs introduced in version 8, such as lambda functions). I know that Java 11 is available, but most enterprises are slow at adopting the newer version (and I found Oracle’s recent Java strategy a little confusing). As of now, only Spark v3 is certified on Java 11.
The examples will use either the command line or Eclipse. For the command line, you can use the following:
- Maven: the book uses version 3.5.2, but any recent version will work.
- Git version 2.13.6, but any recent version will work as well. On macOS, you can use the version packaged with Xcode. On Windows, you can download from https://git-scm.com/download/win . If you like graphical user interfaces (GUIs), I highly recommend Atlassian Sourcetree, which you can download from www.sourcetreeapp.com .
Projects use Maven’s pom.xml structure, which can be imported or directly used in many integrated development environments (IDEs). However, all visual examples will use Eclipse. You could use any earlier version of Eclipse than 4.7.1a (Eclipse Oxygen), but Maven and Git integration have been enhanced in the Oxygen release of Eclipse. I highly recommend that you use at least the Oxygen releases, which are pretty old by now.
The source code is in a public repository on GitHub. The URL of the repository is https://github.com/jgperrin/net.jgp.books.spark.ch01 . Appendix D describes in great detail how to use Git on the command line, and Eclipse to download the code.
You are now ready to run your application! If you have any issues running your first app, appendix R should have you covered.
$ cd net.jgp.books.spark.ch01
$ mvn clean install exec:exec
After importing the project (see appendix D), locate CsvToDataframeApp.java in the Project Explorer. Right-click the file and then select Run As > 2 Java Application, as shown in figure 1.8. Look at the result in the console.
+---+--------+--------------------+-----------+--------------------+ | id|authorId| title|releaseDate| link| +---+--------+--------------------+-----------+--------------------+ | 1| 1|Fantastic Beasts ...| 11/18/16|http://amzn.to/2k...| | 2| 1|Harry Potter and ...| 10/6/15|http://amzn.to/2l...| | 3| 1|The Tales of Beed...| 12/4/08|http://amzn.to/2k...| | 4| 1|Harry Potter and ...| 10/4/16|http://amzn.to/2k...| | 5| 2|Informix 12.10 on...| 4/23/17|http://amzn.to/2i...| +---+--------+--------------------+-----------+--------------------+ only showing top 5 rows
Finally, you are coding! In the previous section, you saw the output. It’s time you run your first application. It will acquire a session, ask Spark to load a CSV file, and then display five rows (at most) of the dataset. Listing 1.1 provides the full program.
When it comes to showing code, two schools of thought exist: one school of thought is to show abstracts, and the other is to present all of the code. I am of the latter school: I like the example to be complete, rather than partial, as I don’t want you to have to figure out the missing part or the needed packages, even if they are obvious.
Listing 1.1 Ingesting CSV
package net.jgp.books.spark.ch01.lab100_csv_to_dataframe; import org.apache.spark.sql.Dataset; import org.apache.spark.sql.Row; import org.apache.spark.sql.SparkSession; public class CsvToDataframeApp { public static void main(String[] args ) { #1 CsvToDataframeApp app = new CsvToDataframeApp(); app .start(); } private void start() { SparkSession spark = SparkSession.builder() #2 .appName( "CSV to Dataset" ) .master( "local" ) .getOrCreate(); Dataset<Row> df = spark .read().format( "csv" ) #3 .option( "header" , "true" ) .load( "data/books.csv" ); df .show(5); #4 } }
- Installed all the components you need to work with Spark. (Yes, it is that easy!)
- Created a session where code can be executed.
- Loaded a CSV data file.
- Displayed five rows of this dataset.
You are now ready to get deeper into Apache Spark and understand a little more about what is under the hood.
- Spark is an analytics operating system; you can use it to process workloads and algorithms in a distributed way. And it’s not only good for analytics: you can use Spark for data transfer, massive data transformation, log analysis, and more.
- Spark supports SQL, Java, Scala, R, and Python as a programming interface, but in this book, we focus on Java (and sometimes Python).
- Spark’s internal main data storage is the dataframe. The dataframe combines storage capacity with an API.
- If you have experience with JDBC development, you will find similarities with a JDBC
ResultSet
. - If you have experience with relational database development, you can compare a dataframe to a table with less metadata.
- In Java, a dataframe is implemented as a
Dataset<Row>
. - You can quickly set up Spark with Maven and Eclipse. Spark does not need to be installed.
- Spark is not limited to the MapReduce algorithm: its API allows a lot of algorithms to be applied to data.
- Streaming is used more and more frequently in enterprises, as businesses want access to real-time analytics. Spark supports streaming.
- Analytics have evolved from simple joins and aggregations. Enterprises want computers to think for us; hence Spark supports machine learning and deep learning.
- Graphs are a special use case of analytics, but nevertheless, Spark supports them.
1. See “Intel Puts the Brakes on Moore’s Law” by Tom Simonite, MIT Technology Review, March 2016 ( http:// mng.bz/gVj8 ).
2. See “Memory Prices (1957-2017)” by John C. McCallum ( https://jcmit.net/memoryprice.htm ).
4. If you are outside the United States, you need to understand how important the SSN is. It governs your entire life. It has almost no connection to its original purpose: once an identifier for social benefits, it has become a tax identifier and financial shadow, following and tracking people. Identity thieves look for SSNs and other personal data so they can open bank accounts or access existing accounts.
7. Health Level Seven International (HL7) is a not-for-profit, ANSI-accredited standards developing organization dedicated to facilitating the exchange, integration, sharing, and retrieval of electronic health information. HL7 is supported by more than 1,600 members from over 50 countries. Fast Healthcare Interoperability Resources (FHIR) is one of the latest specification standards for exchanging health-care information.
8. See “How MTV and Nickelodeon Use Real-Time Big Data Analytics to Improve Customer Experience” by Bernard Marr, Forbes, January 2017 ( http://bit.ly/2ynJvUt ).