4 Getting data into the platform

This chapter covers

Understanding databases, files, APIs, and streams
Ingesting data from RDBMSs using SQL versus change data capture
Parsing and ingesting data from various file formats
Developing strategies to deal with source schema changes
Designing an ingestion pipeline to handle the challenges of data streams
Building an ingestion pipeline for SaaS data
Implementing quality control and monitoring in your ingestion pipeline
Discussing network and security considerations for cloud data ingestion

If you’ve read the chapters up to this point, you’re able to architect a good, layered data lake. Now it’s time to start diving into a few of these layers in much greater detail.

In this chapter, we’ll focus on the ingestion layer. Before you can start using your cloud data platform to produce outcomes using traditional or advanced analytics or reports, you will need to populate it with data. One of the key characteristics of a data platform is its ability to ingest and store data of all types in its native format. This variety does present challenges, so we’ll walk through the most popular data types—RDBMs, files, APIs, and streams—and help you understand how they are different from the perspective of ingestion. We’ll also touch on the networking and security considerations that apply regardless of the data source to be ingested.

4.1 Databases, files, APIs, and streams

4.1.1 Relational databases

4.1.2 Files

4.1.3 SaaS data via API

4.1.4 Streams

4.2 Ingesting data from relational databases

4.2.1 Ingesting data from RDBMSs using a SQL interface

4.2.2 Full-table ingestion

4.2.3 Incremental table ingestion

4.2.4 Change data capture (CDC)