This chapter covers
- Running DAGs regularly using schedule intervals
- Constructing efficient DAGs that load and process data incrementally
- Designing your DAGs for re-processing past datasets using backfilling
In the previous chapter, we explored Airflow’s UI and showed you how to define a basic Airflow DAG and run this DAG every day by defining a schedule interval. In this chapter, we will dive a bit deeper into the concept of scheduling in Airflow and explore how this allows you to process data incrementally at regular intervals. First, we’ll introduce a small use case focussed on analyzing user events from our website and explore how we can build a DAG to analyze these events at regular intervals. Next, we’ll explore ways to make this process more efficient by taking an incremental approach to analyzing our data and how this ties into Airflow’s concept of execution dates. Finally, we’ll finish by showing how we can fill in past gaps in our dataset using backfilling and discussing some important properties of proper Airflow tasks.