chapter five

5 Architecting the ingestion layer

 

This chapter covers

  • Requirements for ingestion performance, reliability, and latency
  • Comparing batch, micro-batch, and streaming ingestion strategies
  • How Iceberg handles data writes, commits, and conflict resolution
  • Ingestion technologies such as Spark, Flink, and Others
  • Ingestion patterns for schema evolution, data quality, and auditability

The ingestion layer is the starting point of your Apache Iceberg lakehouse in practice. It is where raw data enters the system, whether from operational databases, message queues, cloud services, or external vendors. While the storage layer determines how data is preserved, the ingestion layer determines how data arrives—how fast, how clean, and how reliably.

Designing this layer requires more than just choosing an ETL tool. You must consider latency tolerance, throughput capacity, schema evolution, and fault recovery. These requirements can vary widely across use cases. Some pipelines deliver high-frequency transactions that must be processed in seconds. Others may batch up nightly logs or slowly changing dimensions. Your ingestion layer must support both without compromising performance or consistency.

5.1 Ingestion requirements

5.1.1 Ingestion throughput and latency

5.1.2 Reliability and fault tolerance

5.1.3 Schema management and evolution

5.1.4 Operational complexity and maintainability

5.2 Ingestion models and architectures

5.2.1 Batch ingestion

5.2.2 Micro-batch and incremental ingestion

5.2.3 Streaming ingestion

5.3 How Iceberg manages writes

5.3.1 Write semantics in Iceberg

5.3.2 Commit protocols and conflict handling

5.4 Tools and frameworks for ingestion

5.4.1 Apache Spark

5.4.2 Apache Flink

5.4.3 Apache NiFi

5.4.4 Fivetran

5.4.5 Qlik

5.4.6 Airbyte

5.4.7 Confluent

5.4.8 Redpanda

5.4.9 Cloud-native ingestion services

5.4.10 Tool selection considerations

5.5 Applying ingestion requirements in context

5.5.1 Prioritizing low latency

5.5.2 Managing high throughput

5.5.3 Supporting complex transformations

5.5.4 Handling schema evolution

5.5.5 Balancing operational overhead

5.5.6 Considering existing cloud environments