chapter one

1 Meet Apache Airflow

 

This chapter covers

  • Representing data pipelines in workflows as graphs of tasks
  • How Airflow fits into the ecosystem of workflow managers
  • Determining if Airflow is a good fit for you

Enterprises are continuously becoming more reliant on high-quality data to make data-driven decisions and optimize their business processes. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.

Apache Airflow helps you tackle this challenge by building data pipelines that coordinate data operations in an efficient and structured manner. In this process, Airflow is best thought of as an orchestrator conductor: it connects to your different systems and coordinates work between them to ensure a harmonious end-result – high quality data. This work can include a wide variety of operations, from loading data from a source system, to transforming data through queries, training a model, and more.

1.1 Introducing data pipelines

1.1.1 Data pipelines as graphs

1.1.2 Executing a pipeline graph

1.1.3 Pipeline graphs vs. sequential scripts

1.1.4 Running pipelines using workflow managers

1.2 Introducing Airflow

1.2.1 Defining pipelines flexibly in (Python) code

1.2.2 Integration with external systems

1.2.3 Scheduling and executing pipelines

1.2.4 Monitoring and handling failures

1.2.5 Incremental loading and backfilling

1.3 When to use Airflow

1.3.1 Reasons to choose Airflow

1.3.2 Reasons not to choose Airflow

1.4 The rest of this book

1.5 Summary