1 Meet Apache Airflow

 

This chapter covers

  • Representing data pipelines in workflows as graphs of tasks
  • How Airflow fits into the ecosystem of workflow managers
  • Determining if Airflow is a good fit for you

Enterprises are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.

Apache Airflow is one of those tools. Airflow tackles large-scale data processing by splitting it into smaller, time-defined, and more manageable chunks of data known as batches. We focus on building data pipelines with Apache Airflow. However, this doesn’t mean Apache Airflow can’t be used to orchestrate and schedule other workloads. If you can communicate with a system or tool using Python, you can potentially manage it with Airflow. In fact, one of Airflow’s key features is that it enables you to easily build scheduled data pipelines using Python, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.

1.1 Introducing data pipelines

1.1.1 Data pipelines as graphs

1.1.2 Executing a pipeline graph

1.1.3 Pipeline graphs vs. sequential scripts

1.1.4 Running pipelines using workflow managers

1.2 Introducing Airflow

1.2.1 Defining pipelines flexibly in (Python) code

1.2.2 Integration with external systems

1.2.3 Scheduling and executing pipelines

1.2.4 Monitoring and handling failures

1.2.5 Incremental loading and backfilling

1.3 When to use Airflow

1.3.1 Reasons to choose Airflow

1.3.2 Reasons not to choose Airflow

1.4 The rest of this book

1.5 Summary