17. Exporting data and building full data pipelines

 

This chapter covers

  • Exporting data from Spark
  • Building a complete data pipeline, from ingestion to export
  • Understanding the impact of partitioning
  • Using Delta Lake as a database
  • Using Spark with cloud storage

As you are reaching the end of this book, it is time to see how to export data. After all, why did you learn all this if it was just to keep data within Spark, right? I know, I do appreciate learning as a hobby, but it is even better when you can actually bring some business value, right?

This chapter is divided into three sections. The first section covers exporting data. As usual, you will use a real dataset, ingest it, and then export it. You will impersonate a NASA scientist and start exploiting data coming from satellites. Those datasets can be used to prevent wildfires. This is the first step of using code for good! In this section, you will also see the impact of partitioning on exporting data.

In the second part of this chapter, you will experiment with Delta Lake, a database that sits within the core of Spark. Delta Lake can radically simplify your data pipeline, and you will see how and why.

Finally, I will share resources about using Apache Spark with cloud storage providers including AWS, Microsoft Azure, IBM Cloud, OVH, and Google Cloud Platform. Those resources are mainly aimed at helping you navigate in those ever-moving cloud offerings.

17.1 Exporting data

17.1.1 Building a pipeline with NASA datasets

17.1.2 Transforming columns to datetime

17.1.3 Transforming the confidence percentage to confidence level

17.1.4 Exporting the data

17.1.5 Exporting the data: What really happened?

17.2 Delta Lake: Enjoying a database close to your system

17.2.1 Understanding why a database is needed

17.2.2 Using Delta Lake in your data pipeline

17.2.3 Consuming data from Delta Lake

17.3 Accessing cloud storage services from Spark

Summary