12 Best practices

 

This chapter covers

  • Writing clean, understandable DAGs using style conventions
  • Using consistent approaches for managing credentials and configuration options
  • Generating repeated DAGs and tasks using factory functions
  • Designing reproducible tasks by enforcing idempotency and determinism constraints
  • Handling data efficiently by limiting the amount of data processed in your DAG
  • Using efficient approaches for handling/storing (intermediate) data sets
  • Managing concurrency using resource pools

Previously, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. Now, we dive a bit deeper into some best practices that can help you write well-architected DAGs that are both easy to understand and efficient in terms of how they handle your data and resources.

12.1 Writing clean DAGs

Writing DAGs can easily become a messy business. For example, DAG code can quickly become overly complicated or difficult to read—especially if DAGs are written by team members with very different styles of programming. In this section, we touch on some tips to help you structure and style your DAG code, hopefully providing some (often needed) clarity for your intricate data processes.

12.1.1 Use style conventions

12.1.2 Manage credentials centrally

12.1.3    Specify configuration details consistently

12.1.4 Avoid doing any computation in your DAG definition

12.1.5 Use factories to generate common patterns

12.1.6 Group related tasks using task groups

12.1.7 Be explicit when specifying your DAG schedule

12.1.8 Use Dynamic Task Mapping to dynamically generate tasks

12.2 Designing reproducible tasks

12.2.1 Always require tasks to be idempotent

12.2.2 Ensure task results are deterministic

12.2.3 Design tasks using functional paradigms

12.3 Handling data efficiently

12.3.1 Limit the amount of data being processed

12.3.2 Load/process data incrementally

12.3.3 Cache intermediate data

12.3.4 Don’t store data on local file systems

12.3.5 Offload work to external/source systems

12.4 Managing concurrency using pools

12.5 Summary