Appendix B Python for Apache Iceberg
Apache Iceberg has become a central standard for modern data lakehouses, and Python provides one of the most adaptable ecosystems for working with it. This appendix introduces practical ways to use Iceberg directly and indirectly through leading Python libraries and frameworks. Each section focuses on a single library, explains its connection to Iceberg, and includes step-by-step examples for both ETL and analytical workloads.
The goal is to show how to build, manage, and analyze Iceberg data entirely in Python, without depending on JVM-based systems such as Spark. You’ll learn how to define schemas, create tables, append and overwrite data, and perform queries using tools like PyIceberg, Polars, DuckDB, Daft, PyDremio, Bauplan, and SpiceAI.
Each tool plays a different role in the Python-Iceberg ecosystem:
- PyIceberg provides direct, low-level access to Iceberg tables and catalogs.
- Polars and DuckDB deliver high-performance, in-memory analytics on Iceberg data.
- Daft adds distributed computation built on Apache Arrow.
- Dremio provides comprehensive SQL support for Iceberg, featuring scalable query execution.
- Bauplan extends Iceberg’s data model with Git-style branching and version control.
- SpiceAI enables federated queries and intelligent analytics over Iceberg datasets.