chapter eight

8 Building data pipelines with DuckDB

 

This chapter covers

  • The meaning and relevance of data pipelines
  • What roles DuckDB can have as part of a pipeline
  • How DuckDB integrates with tools like the Python based data load tool (dlt) for ingestion and the data build tool (dbt) from dbt Labs for transformation
  • Orchestrating pipelines with Apache Dagster

Having explored DuckDB’s seamless integration with prominent data processing languages like Python, and libraries such as pandas, Apache Arrow, and Polars in Chapter 6, we know that DuckDB and its ecosystem is capable of tackling various tasks that belong to data pipelines and can therefore be used within them. The combination of a powerful SQL engine, well integrated tooling, and the potential of a cloud offering makes it more than just another database system.

In this chapter, we’ll delve deeper into DuckDB’s role within the broader data ecosystem, emphasizing its significance in building robust data pipelines and enhancing workflows. For this we will first take a step back and discuss the meaning and relevance of data pipelines. Then we are going to evaluate a couple of tools that we think are helpful while building robust pipelines. These tools cover ingestion, transformation, and orchestration.

Let’s start with the basics and have a look at the problems we want to solve.

8.1 Data pipelines and the role of DuckDB

8.2 Data ingestion with dlt

8.2.1 Installing a supported source

8.2.2 Building a pipeline

8.2.3 Exploring pipeline metadata

8.3 Data transformation and modeling with dbt

8.3.1 Setting up a dbt project

8.3.2 Defining sources

8.3.3 Describing transformations with models

8.3.4 Testing transformations and pipelines

8.4 Orchestrating data pipelines with Dagster

8.4.1 Defining assets

8.4.2 Running pipelines

8.4.3 Managing dependencies in a pipeline

8.4.4 Uploading to MotherDuck

8.5 Summary