4 Getting data into the platform

This chapter covers

The differences between databases (RDBMS and NoSQL), files, APIs and streams as data sources
How to work around the most challenging attributes of each data source type
The differences between using a SQL interface and Change Data Capture techniques for ingestion from RDBMS
The important statistics that you need to capture in your ingestion pipeline to be able to implement quality control and monitoring later.
Network and security considerations for data ingestion into the cloud

If you’ve read the chapters up to this point, you’re able to architect a good layered data lake. Now it’s time to start diving into a few of these layers in much greater detail.

In this chapter, we’ll focus on the ingestion layer. Before you can start using your cloud data platform to produce outcomes using traditional or advanced analytics or reports, you will need to populate it with data.. One of the key characteristics of a data platform is its ability to ingest and store data of all types in its native format. This variety does present challenges so we’ll walk through the most popular data types - RDBMs, files, APIs and streams and help you understand how they are different from the perspective of ingestion. We’ll also touch on the networking and security considerations that apply regardless of data source to be ingested.

By the end of the chapter you’ll be able to:

4.1 Databases, files, APIs and streams

4.1.1 Relational databases

4.1.2 Files

4.1.3 SaaS data via API

4.1.4 Streams

4.2 Ingesting data from relational databases

4.2.1 Ingesting data from RDBMS using an SQL interface

4.2.2 Full table ingestion

4.2.3 Incremental table ingestion

4.2.4 Change Data Capture (CDC)

4.2.5 CDC Vendors Overview

4.2.6 Data Types Conversion

4.2.7 Ingesting data from NoSQL databases

4.2.8 Capturing important metadata for RDBMS or NoSQL ingestion pipeline

4.3 Ingesting data from files

4.3.1 Tracking ingested files

4.3.2 Capturing file ingestion metadata

4.4 Ingesting data from streams

4.4.1 Differences between batch and streaming ingestion