chapter eight

8 AI and The Data Lifecycle

 

In Chapter 5, we showed how AI could play a small but important role in a real-world pipeline. It classified article sentiment from the News API using just a few lines of Python and an API call. It was fast, clever, and useful, but in truth, it was only the beginning.

Now, in Chapter 8, we expand on the journey that began in Chapter 5, where we used AI for a single task, classifying article sentiment in a Python workflow. In Chapter 6, we explored how AI can validate and clean messy data. Then, in Chapter 7, we applied advanced transformations using AI to reshape and enrich complex datasets. Now, in this chapter, AI is no longer just assisting with one part of the process. It is integrated at every stage of the data lifecycle.

We will extract articles from the News API, just like before. Then, we will use AI to extract structured fields, normalize dates and sources, and generate SQL to insert our results into a relational database. This process follows the ETL model—Extract, Transform, Load—which is the foundational paradigm in data engineering.

8.1 From AI insights to data pipelines

8.1.1 Evolving AI Integration

8.1.2 Understanding ETL and ELT

8.2 Extracting News Data with AI

8.2.1 Extracting the Raw API Payload

8.2.2 Extracting Data with AI

8.3 Transforming News Data with AI

8.3.1 The Transformation Prompt

8.3.2 The AI Data Engineering Code Harness

8.3.3 The Transformation Pipeline

8.4 Loading News Data with AI

8.4.1 The Contract and Prompt

8.4.2 Response Handling

8.5 Lab

8.6 Lab Answers