chapter seven

7. Ingestion from files

This chapter covers

Common behaviors of parsers
Ingesting from CSV, JSON, XML, and text files
Understanding the difference between one-line and multiline JSON records
Understanding the need for big data-specific file formats

Ingestion is the first step of your big data pipeline. You will have to onboard the data in your instance of Spark, whether it is in local mode or cluster mode. As you know by now, data in Spark is transient, meaning that when you shut down Spark, it’s all gone. You will learn how to import data from standard files including CSV, JSON, XML, and text.

In this chapter, after learning about common behaviors among various parsers, you’ll use made-up datasets to illustrate specific cases, as well as datasets coming from open data platforms. It will be tempting to start performing analytics with those datasets. As you see the data displayed onscreen, you will start thinking, “What happens if I join this dataset with this other one? What if I start aggregating this field . . . ?” You will learn how to do those actions in chapters 11 through 15 and chapter 17, but first you need to get all that data into Spark!

The examples in this chapter are based on Spark v3.0. Behaviors have evolved over time, especially when dealing with CSV files.

7.1 Common behaviors of parsers

7.2 Complex ingestion from CSV

7.2.1 Desired output

7.2.2 Code

7.3 Ingesting a CSV with a known schema

7.3.1 Desired output

7.3.2 Code

7.4 Ingesting a JSON file

7.4.1 Desired output

7.4.2 Code

7.5 Ingesting a multiline JSON file

7.5.1 Desired output

7.5.2 Code

7.6 Ingesting an XML file

7.6.1 Desired output

7.6.2 Code

7.7 Ingesting a text file

7.7.1 Desired output

7.7.2 Code

7.8 File formats for big data