11 Designing data pipelines

 

This chapter covers

  • Designing data pipelines
  • Comparing data pipeline patterns
  • Choosing data transformation layers
  • Defining role-based access control
  • Building a sample data pipeline

A data pipeline is an automated sequence of steps designed to support the extraction, movement, ingestion, transformation, storage, and presentation of data from a source to a target platform. Data engineers must design data pipelines before writing code to implement them. To create a sound design, they must understand the purpose of the data pipeline, identify its data sources and targets, decide on a data pipeline pattern, choose the appropriate data transformation layers, and consider other user requirements, such as data governance and security.

In this chapter, we will design a data pipeline that ingests data from multiple sources. We will compare data pipeline patterns, including ETL (extract-transform-load), ELT (extract-load-transform), and ETLT (extract-transform-load-transform). We will choose the data transformation layers, such as extract, staging, data warehouse, or presentation. We will set up role-based access control so that only authorized users can access data in the various layers of the pipeline.

11.1 Designing data pipelines

11.1.1 Extracting data

11.1.2 Comparing data pipeline patterns

11.1.3 Choosing data transformation layers

11.1.4 Organizing data warehouse layers

11.1.5 Creating schemas with access control

11.2 Building a sample data pipeline

11.2.1 Implementing the extraction layer

11.2.2 Implementing the staging layer

11.2.3 Implementing the data warehouse layer

11.2.4 Implementing the reporting layer

Summary