3 Scheduling in Airflow

 

This chapter covers

  • Running DAGs at regular intervals or when the data is updated
  • Constructing dynamic DAGs to process data incrementally
  • Loading and reprocessing previously processed data using backfilling
  • Applying best practices to enhance task reliability

In the previous chapter, we explored Airflow’s UI and showed you how to define a basic Airflow DAG and run it every day by defining a scheduled interval. In this chapter, we will dive a bit deeper into the concept of scheduling in Airflow and explore how this allows you to process data incrementally at regular intervals. First, we’ll introduce a small use case scenario focused on analyzing user events from our website and explore how we can build a DAG to analyze these events at regular intervals. Next, we’ll explore ways to make this process more efficient by taking an incremental approach to analyzing our data and understanding how this ties into Airflow’s concept of schedule intervals. We’ll also take a look at a scheduling option based on specific event times. In addition, we’ll dive into how we can fill in past gaps in our data set using backfilling and discuss some important properties of proper Airflow tasks. Finally, we will explore how to trigger a DAG based on data updates using Datasets.

3.1 An example: processing user events

3.2 Running at regular intervals

3.2.1 Defining scheduling intervals

3.2.2 Processing data incrementally

3.2.3 Understanding Airflow’s intervals

3.2.4 Using backfilling to fill in past gaps

3.3 Handling irregular intervals

3.4 Reacting to dataset updates

3.4.1 Splitting producers and consumers

3.4.2 Meet the Dataset

3.4.3 The Dataset in action

3.4.4 Using multiple Datasets

3.5 Designing well-behaved tasks

3.5.1 Atomicity

3.5.2 Idempotency

3.6 Summary