11 Designing data pipelines
This chapter covers
- Designing data pipelines
- Comparing data pipeline patterns
- Choosing data transformation layers
- Defining role-based access control
- Building a sample data pipeline
A data pipeline is an automated sequence of steps designed to support the extraction, movement, ingestion, transformation, storage, and presentation of data from a source to a target platform. Data engineers must design data pipelines before writing code to implement them. To create a sound design, they must understand the purpose of the data pipeline, identify its data sources and targets, decide on a data pipeline pattern, choose the appropriate data transformation layers, and consider other user requirements, such as data governance and security.
In this chapter, we will design a data pipeline that ingests data from multiple sources. We will compare data pipeline patterns, including ETL (extract-transform-load), ELT (extract-load-transform), and ETLT (extract-transform-load-transform). We will choose the data transformation layers, such as extract, staging, data warehouse, or presentation. We will set up role-based access control so that only authorized users can access data in the various layers of the pipeline.