8 Building data pipelines with DuckDB

 

This chapter covers

  • The meaning and relevance of data pipelines
  • What roles DuckDB can have as part of a pipeline
  • How DuckDB integrates with tools like the Python-based data load tool for ingestion and the data build tool from dbt Labs for transformation
  • Orchestrating pipelines with Dagster

Having explored DuckDB’s seamless integration with prominent data processing languages, such as Python, and libraries, such as pandas, Apache Arrow, and Polars, in chapter 6, we know that DuckDB and its ecosystem are capable of tackling various tasks that belong to data pipelines and can, therefore, be used within them. The combination of a powerful SQL engine, well-integrated tooling, and the potential of a cloud offering makes it more than just another database system.

In this chapter, we’ll delve deeper into DuckDB’s role within the broader data ecosystem, emphasizing its significance in building robust data pipelines and enhancing workflows. For this, we will first take a step back and discuss the meaning and relevance of data pipelines. Then, we are going to evaluate a couple of tools that we think are helpful when building robust pipelines. These tools cover ingestion, transformation, and orchestration. Let’s start with the basics and have a look at the problems we want to solve.

8.1 Data pipelines and the role of DuckDB

8.2 Data ingestion with dlt

8.2.1 Installing a supported source

8.2.2 Building a pipeline

8.2.3 Exploring pipeline metadata

8.3 Data transformation and modeling with dbt

8.3.1 Setting up a dbt project

8.3.2 Defining sources

8.3.3 Describing transformations with models

8.3.4 Testing transformations and pipelines

8.3.5 Transforming all CSV files

8.4 Orchestrating data pipelines with Dagster

8.4.1 Defining assets

8.4.2 Running pipelines

8.4.3 Managing dependencies in a pipeline

8.4.4 Advanced computation in assets

8.4.5 Uploading to MotherDuck

Summary