3 Scheduling in Airflow

 

This chapter covers

  • Running DAGs at regular intervals
  • Constructing dynamic DAGs to process data incrementally
  • Loading and reprocessing past data sets using backfilling
  • Applying best practices for reliable tasks

In the previous chapter, we explored Airflow’s UI and showed you how to define a basic Airflow DAG and run it every day by defining a scheduled interval. In this chapter, we will dive a bit deeper into the concept of scheduling in Airflow and explore how this allows you to process data incrementally at regular intervals. First, we’ll introduce a small use case focused on analyzing user events from our website and explore how we can build a DAG to analyze these events at regular intervals. Next, we’ll explore ways to make this process more efficient by taking an incremental approach to analyzing our data and understanding how this ties into Airflow’s concept of execution dates. Finally, we’ll finish by showing how we can fill in past gaps in our data set using backfilling and discussing some important properties of proper Airflow tasks.

3.1 An example: Processing user events

 
 

3.2 Running at regular intervals

 
 
 

3.2.1 Defining scheduling intervals

 
 

3.2.2 Cron-based intervals

 
 
 

3.2.3 Frequency-based intervals

 
 

3.3 Processing data incrementally

 
 
 

3.3.1 Fetching events incrementally

 
 
 

3.3.2 Dynamic time references using execution dates

 
 
 

3.3.3 Partitioning your data

 
 

3.4 Understanding Airflow’s execution dates

 
 

3.4.1 Executing work in fixed-length intervals

 
 
 
 

3.5 Using backfilling to fill in past gaps

 
 

3.5.1 Executing work back in time

 
 

3.6 Best practices for designing tasks

 
 
 
 

3.6.1 Atomicity

 
 
 

3.6.2 Idempotency

 

Summary

 
 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage