6 Integrating with the Python ecosystem

 

This chapter covers

  • The differences between DuckDB’s implementation of Python DB-API 2.0 and the DuckDB relational API
  • Ingesting data from pandas DataFrames, Apache Arrow tables, and more via the Python API
  • Querying pandas DataFrames with DuckDB methods
  • Exporting data to various DataFrames formats and Apache Arrow Tables
  • Using DuckDB’s relational API to compose queries

Up until now, we’ve consistently used the DuckDB CLI to manage and execute our queries. This tool is highly effective for on-the-spot analysis and for CLI-based pipelines. Many data workflows, however, involve Python and its ecosystem to a large extent. For example, pandas DataFrames can’t be ignored. In this chapter, we will learn that DuckDB’s Python API goes way beyond just implementing the Python DB-API. DuckDB’s Python API will let you not only use the embedded database in your Python process but also query Python objects like you would tables. At the same time, you can easily convert results from queries to DataFrames. In this chapter, we focus on integrations that are directly bundled with the DuckDB Python package.

6.1 Getting started

6.1.1 Installing the Python package

6.1.2 Opening up a database connection

6.2 Using the relational API

6.2.1 Ingesting CSV data with the Python API

6.2.2 Composing queries

6.2.3 SQL querying

6.3 Querying pandas DataFrames

6.4 User-defined functions

6.5 Interoperability with Apache Arrow and Polars

Summary