1 Meet Apache Airflow
This chapter covers
- Representing data pipelines in workflows as graphs of tasks
- How Airflow fits into the ecosystem of workflow managers
- Determining if Airflow is a good fit for you
Enterprises are continuously becoming more data-driven and are developing data pipelines as part of their daily business. Data volumes involved in these business processes have increased substantially over the years, from megabytes per day to gigabytes per minute. Though handling this data deluge may seem like a considerable challenge, these increasing data volumes can be managed with the appropriate tooling.
This book focuses on Apache Airflow, a batch-oriented[1] framework for building data pipelines[2]. Airflow’s key feature is that it enables you to easily build scheduled data pipelines using Python, while also providing many building blocks that allow you to stitch together the many different technologies encountered in modern technological landscapes.
Airflow is best thought of as a spider in a web: it sits in the middle of your data processes and coordinates work happening across the different (distributed) systems. As such, Airflow is not a data processing tool in itself but orchestrates the different components responsible for processing your data in data pipelines.