chapter eleven

11 Best practices

 

This chapter covers:

  • Writing clean, understandable DAGs using style conventions.
  • Using consistent approaches for managing credentials and configuration options.
  • Efficiently generating repeated DAGs and task structures using factory functions and DAG/task configurations.
  • Designing reproducible tasks by enforcing idempotency and determinism constraints, optionally using approaches inspired by functional programming.
  • Handling data efficiently by limiting the amount of data processed in your DAG, as well as using efficient approaches for handling/storing (intermediate) datasets.
  • Effectively managing the resources of your (big) data processes by processing data in the most appropriate systems, whilst managing concurrency using resource pools.

In previous chapters, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. In this chapter, we dive a bit deeper into some best practices that can help you write well-architected DAGs that are both easy-to-understand and efficient in terms of how they handle your data and resources.

11.1  Writing clean DAGs

11.1.1    Use style conventions

11.1.2    Manage credentials centrally

11.1.3    Specify configuration details consistently

11.1.4    Avoid doing any computation in your DAG definition

11.1.5    Use factories to generate common patterns

11.1.6    Grouping related tasks using task groups

11.1.7    Create new DAGs for big changes

11.2  Designing reproducible tasks

11.2.1    Always require tasks to be idempotent

11.2.2    Task results should be deterministic

11.2.3    Design tasks using functional paradigms

11.3  Handling data efficiently

11.3.1    Limit the amount of data being processed

11.3.2    Incremental loading/processing

11.3.3    Cache intermediate data

11.3.4    Don’t store data on local file systems

11.3.5    Offload work to external/source systems

11.5  Summary