chapter nine

9 Maintaining an Iceberg lakehouse

 

This chapter covers

  • Identifying and resolving performance issues from suboptimal data and metadata files
  • Running compaction jobs to optimize file layout and improve query speed
  • Managing snapshot retention to reduce storage footprint and meet compliance needs
  • Using Iceberg metadata tables to monitor table health and guide maintenance

Designing and deploying a lakehouse is only the beginning. Long-term value comes from keeping the platform performant, governed, and resilient over time. Apache Iceberg provides powerful capabilities for data organization, schema evolution, and transaction isolation, but without proactive maintenance, those strengths can erode. As datasets grow, write patterns evolve, and business needs change, the lakehouse must adapt.

9.1 Problem: Suboptimal data files

9.1.1 Small files

9.1.2 Poorly colocated data

9.1.3 Metadata sprawl

9.1.4 Merge-on-read (MOR) performance hits

9.2 Solution: Compaction

9.2.1 What is compaction?

9.2.2 Target file size

9.2.3 Files to be included

9.2.4 Using filters to scope compaction

9.3 Storage footprint management and data retention

9.3.1 Running snapshot expiration

9.3.2 COW vs MOR: Implications for data retention

9.3.3 Regulatory considerations for data deletion

9.4 Exploring Apache Iceberg's metadata tables

9.5 Access controls in an Iceberg lakehouse

9.5.1 Storage layer controls

9.5.2 Catalog-level controls

9.5.3 Engine-level access controls

9.6 Summary