chapter three

3 Time-based scheduling

 

This chapter covers

  • Running DAGs at regular or irregular points in time
  • Processing data incrementally using data intervals
  • Loading and reprocessing previously processed data using backfilling
  • Applying best practices to enhance task reliability

In the first two chapters, we explored Airflow’s UI and learned how to define a basic Airflow directed acyclic graph (DAG) and run it every day by defining a scheduled interval. In this chapter, we’ll dive a bit deeper into scheduling in Airflow and explore how it allows us to process data incrementally at regular intervals. First, we’ll introduce a small use case scenario focused on analyzing user events from our website and explore how to build a DAG to analyze these events at regular points in time. Next, we’ll explore ways to make this process more efficient by taking an incremental approach to analyzing our data and seeing how it ties into Airflow’s concept of schedule intervals. We’ll also look at a scheduling option based on specific event times. Finally, we’ll show how to fill gaps in our data set by using backfilling and discuss important properties of proper Airflow tasks.

3.1 Processing user events

3.2 The basic components of an Airflow schedule

3.3 Running regularly using trigger-based schedules

3.3.1 Defining a daily schedule

3.3.2 Using cron expressions

3.3.3 Using shorthand expressions

3.3.4 Using frequency-based timetables

3.3.5 Summarizing trigger timetables

3.4 Incremental processing with data intervals

3.4.1 Processing data incrementally

3.4.2 Defining incremental schedules with data intervals

3.4.3 Defining intervals using frequencies

3.4.4 Summarizing interval-based schedules

3.5 Handling irregular intervals

3.6 Managing backfilling of historical data

3.7 Designing well-behaved tasks

3.7.1 Atomicity

3.7.2 Idempotency

Summary