1 Meet Apache Airflow

 

This chapter covers:

  • Introducing representations of data pipelines as graphs of tasks and task dependencies, which can be executed using workflow managers such as Airflow.
  • Establishing a high-level overview of Airflow and how it fits into the overall ecosystem of workflow managers.
  • Examining several strengths/weaknesses of Airflow to determine if Airflow is a good fit for solving your specific use cases.

People and companies are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.

This book focuses on Apache Airflow, a batch-oriented framework for building data pipelines. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using a flexible Python framework, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.

1.1      Introducing data pipelines

1.1.1   Data pipelines as graphs

1.1.2   Executing a pipeline graph

1.1.3   Pipeline graphs vs. sequential scripts

1.1.4   Running pipeline using workflow managers

1.2      Introducing Airflow

1.2.1   Defining pipelines flexibly in (Python) code

1.2.2   Scheduling and executing pipelines

1.2.3   Monitoring and handling failures

1.2.4   Incremental loading and backfilling

1.3      When to use Airflow

1.3.1   Reasons to choose Airflow

1.3.2   Reasons NOT to choose Airflow

1.4      The rest of this book

1.5      Summary

sitemap