12 Ingesting data incrementally

 

This chapter covers

  • Comparing data ingestion approaches
  • Preserving history with slowly changing dimensions
  • Detecting changes with Snowflake streams
  • Maintaining data with dynamic tables
  • Querying historical data

In previous chapters, we built data pipelines that handled small amounts of data, and we didn’t consider performance or regular pipeline execution. However, in real-world scenarios, data engineers usually deal with large data volumes that require additional considerations in pipeline design, such as avoiding processing all data every time the pipeline executes. One way to limit the processed data volume during pipeline execution is to ingest data incrementally.

Incremental data ingestion is faster than full ingestion, as it involves moving less data, resulting in lower storage and compute costs. Virtual warehouses require less time to process the data and consume fewer credits. Additionally, the intermediate data pipeline layers store less data.

12.1 Comparing data ingestion approaches

12.1.1 Full ingestion

12.1.2 Incremental ingestion

12.2 Preserving history with slowly changing dimensions

12.2.1 SCD type 2

12.2.2 Append-only strategy

12.2.3 Designing idempotent data pipelines

12.3 Detecting changes with Snowflake streams

12.3.1 Ingesting files from cloud storage incrementally

12.3.2 Preserving history when ingesting data incrementally

12.4 Maintaining data with dynamic tables

12.4.1 Deciding when to use dynamic tables

12.4.2 Querying historical data

Summary