1 Meet Apache Airflow

 

This chapter covers:

  • What Apache Airflow is
  • What workflow managers are
  • How does Airflow work
  • What problems does Airflow solve
  • Is Airflow right for your company

1.1       What is Apache Airflow

A data pipeline is a series of steps that together are responsible for a certain process. Think of a machine learning model that once a week loads new data, transforms the data, trains a model, and finally deploys the model. Or, an ETL job where once a day we merge multiple data sources and compute aggregate statistics for reporting purposes.

Some pipelines use real-time data, others use batch data. Either approach has its own benefits. Apache Airflow is a platform for programmatically developing and monitoring batch data pipelines.

Airflow provides a Python framework to develop data pipelines composed of different technologies. Airflow pipelines themselves are also defined in Python scripts, the Airflow framework provides a set of building blocks to communicate with a wide array of technologies.

Think of Airflow like a spider in a web; it starts and stops tasks which can run on different technologies in different systems. We will isolate and examine all components of Airflow, and teach corresponding parts of the process of developing data pipelines with Airflow.

1.2       Introducing workflow managers

1.2.1   Workflow as a series of tasks

1.2.2   Expressing task dependencies

1.2.3   Workflow management systems

1.2.4   Configuration as code

1.2.5   Task execution model of workflow management systems

1.3       An overview of the Airflow architecture

1.3.1   Directed Acyclic Graphs

1.3.2   Batch processing

1.3.3   Defined in Python code

1.3.4   Scheduling and backfilling

1.3.5   Handling Failures

1.4       How to know if Airflow is right for you

1.4.1   When can Airflow go wrong?

1.4.2   Who will find Airflow useful?

1.5       Summary

sitemap