3 Time-based scheduling in Airflow

 

This chapter covers

  • Running DAGs at regular or irregular points in time
  • Processing data incrementally using data intervals
  • Loading and reprocessing previously processed data using backfilling
  • Applying best practices to enhance task reliability
  • Triggering DAGs based on data updates with Data Assets.

Previously, we explored Airflow’s UI and showed you how to define a basic Airflow DAG and run it every day by defining a scheduled interval. Now, we will dive a bit deeper into the concept of scheduling in Airflow and explore how this allows you to process data incrementally at regular intervals. First, we’ll introduce a small use case scenario focused on analyzing user events from our website and explore how we can build a DAG to analyze these events at regular points in time. Next, we’ll explore ways to make this process more efficient by taking an incremental approach to analyzing our data and understanding how this ties into Airflow’s concept of schedule intervals. We’ll also look at a scheduling option based on specific event times. Finally, we’ll dive into how we can fill in past gaps in our data set using backfilling and discuss some important properties of proper Airflow tasks.

3.1 An example: processing user events

3.2 The basic components of an Airflow schedule

3.3 Running regularly using trigger-based schedules

3.3.1 Defining a daily schedule

3.3.2 Using Cron expressions

3.3.3 Using shorthand expressions

3.3.4 An alternative: using frequency-based timetables

3.3.5 Summarizing trigger timetables

3.4 Incremental processing with data intervals

3.4.1 Processing data incrementally

3.4.2 Defining incremental schedules with data intervals

3.4.3 Defining intervals using time deltas

3.4.4 Summarizing interval-based schedules

3.5 Handling irregular intervals

3.6 Managing backfilling of historical data

3.7 Designing well-behaved tasks

3.7.1 Atomicity

3.7.2 Idempotency

3.8 Summary