This chapter covers
Ingestion is the first step of your big data pipeline. You will have to onboard the data in your instance of Spark, whether it is in local mode or cluster mode. As you know by now, data in Spark is transient, meaning that when you shut down Spark, it’s all gone. You will learn how to import data from standard files including CSV, JSON, XML, and text.
In this chapter, after learning about common behaviors among various parsers, you’ll use made-up datasets to illustrate specific cases, as well as datasets coming from open data platforms. It will be tempting to start performing analytics with those datasets. As you see the data displayed onscreen, you will start thinking, “What happens if I join this dataset with this other one? What if I start aggregating this field . . . ?” You will learn how to do those actions in chapters 11 through 15 and chapter 17, but first you need to get all that data into Spark!