5 Architecting the ingestion layer
This chapter covers
- Requirements for ingestion performance, reliability, and latency
- Comparing batch, micro-batch, and streaming ingestion strategies
- How Iceberg handles data writes, commits, and conflict resolution
- Ingestion technologies such as Spark, Flink, and Others
- Ingestion patterns for schema evolution, data quality, and auditability
The ingestion layer is the starting point of your Apache Iceberg lakehouse in practice. It is where raw data enters the system, whether from operational databases, message queues, cloud services, or external vendors. While the storage layer determines how data is preserved, the ingestion layer determines how data arrives—how fast, how clean, and how reliably.
Designing this layer requires more than just choosing an ETL tool. You must consider latency tolerance, throughput capacity, schema evolution, and fault recovery. These requirements can vary widely across use cases. Some pipelines deliver high-frequency transactions that must be processed in seconds. Others may batch up nightly logs or slowly changing dimensions. Your ingestion layer must support both without compromising performance or consistency.