chapter two

2 Anatomy of an Airflow DAG

 

This chapter covers

  • Running Airflow on your own machine
  • Writing and running your first workflow
  • Examining the first view at the Airflow interface
  • Handling failed tasks in Airflow

By now, you have a decent overview-level understanding of what data pipelines are and how Airflow can help us to manage them. To get a feeling for how this works in practice, let’s get our hands dirty on a small example pipeline that demonstrates the basic building blocks found in many workflows.

2.1 Collecting data from numerous sources

Rockets are one of humanity’s engineering marvels, and every rocket launch attracts attention all around the world. Our friend John is a rocket enthusiast who tracks and follows every single rocket launch. The news about rocket launches is found in many news sources that John keeps track of, and, ideally, John would like to have all his rocket news aggregated in a single location. John recently picked up programming and would like to have some sort of automated way to collect information about all rocket launches and eventually some sort of personal insight into the latest rocket news. To start small, John decided to first collect images of rockets.

2.1.1 Exploring the data

For the data, we make use of the Launch Library 2 (https://thespacedevs.com/llapi), an online repository of data about both historical and future rocket launches from various sources. It is a free and open API for anybody on the planet (subject to rate limits).

2.2 Writing your first Airflow DAG

2.2.1 Tasks vs. operators

2.2.2 Running arbitrary Python code

2.3 Running a DAG in Airflow

2.3.1 Running Airflow in a Python environment

2.3.2 Running Airflow with Docker

2.3.3 Inspecting the DAG in Airflow

2.4 Running at regular intervals

2.5 Handling failing tasks

2.6 Dag Versioning

2.7 Summary