11 Best practices

 

This chapter covers

  • Writing clean, understandable DAGs using style conventions
  • Using consistent approaches for managing credentials and configuration options
  • Generating repeated DAGs and tasks using factory functions
  • Designing reproducible tasks by enforcing idempotency and determinism constraints
  • Handling data efficiently by limiting the amount of data processed in your DAG
  • Using efficient approaches for handling/storing (intermediate) data sets
  • Managing managing concurrency using resource pools

In previous chapters, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. In this chapter, we dive a bit deeper into some best practices that can help you write well-architected DAGs that are both easy to understand and efficient in terms of how they handle your data and resources.

11.1 Writing clean DAGs

Writing DAGs can easily become a messy business. For example, DAG code can quickly become overly complicated or difficult to read—especially if DAGs are written by team members with very different styles of programming. In this section, we touch on some tips to help you structure and style your DAG code, hopefully providing some (often needed) clarity for your intricate data processes.

11.1.1 Use style conventions

11.1.2 Manage credentials centrally

11.1.3 Specify configuration details consistently

11.1.4 Avoid doing any computation in your DAG definition

11.1.5 Use factories to generate common patterns

11.1.6 Group related tasks using task groups

11.1.7 Create new DAGs for big changes

11.2 Designing reproducible tasks

11.2.1 Always require tasks to be idempotent

11.2.2 Task results should be deterministic

11.2.3 Design tasks using functional paradigms

11.3 Handling data efficiently

11.3.1 Limit the amount of data being processed

11.3.2 Incremental loading/processing

11.3.3 Cache intermediate data

sitemap