This chapter covers
- The differences between databases (RDBMS and NoSQL), files, APIs and streams as data sources
- How to work around the most challenging attributes of each data source type
- The differences between using a SQL interface and Change Data Capture techniques for ingestion from RDBMS
- The important statistics that you need to capture in your ingestion pipeline to be able to implement quality control and monitoring later.
- Network and security considerations for data ingestion into the cloud
If you’ve read the chapters up to this point, you’re able to architect a good layered data lake. Now it’s time to start diving into a few of these layers in much greater detail.
In this chapter, we’ll focus on the ingestion layer. Before you can start using your cloud data platform to produce outcomes using traditional or advanced analytics or reports, you will need to populate it with data.. One of the key characteristics of a data platform is its ability to ingest and store data of all types in its native format. This variety does present challenges so we’ll walk through the most popular data types - RDBMs, files, APIs and streams and help you understand how they are different from the perspective of ingestion. We’ll also touch on the networking and security considerations that apply regardless of data source to be ingested.
By the end of the chapter you’ll be able to: