1 Meet Apache Airflow
This chapter covers:
- What is Apache Airflow
- What problems does Airflow solve
- Is Airflow the right tool for you
People and companies are continuously becoming more data-savvy and are developing data pipelines as part of their daily business. Data volumes have increased substantially over the years, rising from mere megabytes per day in early applications to the petabyte-per-day datasets that are encountered today. We can manage the data deluge though.
Some pipelines use real-time data, and others use batch data. Either approach has its own benefits. Apache Airflow is a platform for developing and monitoring batch data pipelines.
Airflow provides a framework to integrate data pipelines of different technologies. Airflow workflows are defined in Python scripts, which provide a set of building blocks to communicate with a wide array of technologies.
Think of Airflow as the spider in a web; it controls systems as a distributed architecture. It is not a data processing tool in itself, but orchestrates processes doing so. This book isolates and teaches parts of that process in the sprawling web. We will examine all components of Airflow in detail. First, before we breakdown the aerial view of the Airflow’s architecture and determine if this tool is right for you, let’s do a quick overview of workflow managers to align our working mindset.