4 Getting data into the platform

 

This chapter covers

  • Understanding databases, files, APIs, and streams
  • Ingesting data from RDBMSs using SQL versus change data capture
  • Parsing and ingesting data from various file formats
  • Developing strategies to deal with source schema changes
  • Designing an ingestion pipeline to handle the challenges of data streams
  • Building an ingestion pipeline for SaaS data
  • Implementing quality control and monitoring in your ingestion pipeline
  • Discussing network and security considerations for cloud data ingestion

If you’ve read the chapters up to this point, you’re able to architect a good, layered data lake. Now it’s time to start diving into a few of these layers in much greater detail.

In this chapter, we’ll focus on the ingestion layer. Before you can start using your cloud data platform to produce outcomes using traditional or advanced analytics or reports, you will need to populate it with data. One of the key characteristics of a data platform is its ability to ingest and store data of all types in its native format. This variety does present challenges, so we’ll walk through the most popular data types—RDBMs, files, APIs, and streams—and help you understand how they are different from the perspective of ingestion. We’ll also touch on the networking and security considerations that apply regardless of the data source to be ingested.

4.1 Databases, files, APIs, and streams

4.1.1 Relational databases

4.1.2 Files

4.1.3 SaaS data via API

4.1.4 Streams

4.2 Ingesting data from relational databases

4.2.1 Ingesting data from RDBMSs using a SQL interface

4.2.2 Full-table ingestion

4.2.3 Incremental table ingestion

4.2.4 Change data capture (CDC)

sitemap