chapter nine

9 Data Cleaning and Transformation Pipelines in Practice

 

In the previous chapters, you used AI to clean data, perform advanced transformations, and walk a full ETL (Extract, Transform, Load) lifecycle from raw API responses to a relational database. You saw AI assist at each stage of the pipeline, from spotting data quality issues to generating SQL for loading structured results, so the core building blocks of modern data engineering should now feel familiar.

In this chapter, we pull those pieces together into a capstone scenario that looks closer to a real production workflow. Instead of small, neatly scoped datasets, you will handle noisy, semi-structured, and unstructured event data that resembles what you would see in real systems. The goal is to show how data quality checks, transformation logic, and lifecycle thinking scale when the data arrives continuously rather than in simple batches.

9.1 Data Orchestration

9.1.1 Apache Airflow

9.1.2 Beyond Scheduling

9.1.3 Task Framework

9.2 Event Driven Architecture

9.2.1 What are events?

9.2.2 Pub/Sub and Beyond

9.3 Pipelines in Practice

9.3.1 Inspecting the Data & Inferring the Schema

9.3.2 Extracting the basics

9.3.3 Data Quality Transformations

9.3.4 Advanced Transformations

9.3.5 Analysis

9.4 Lab

9.5 Lab Answers