10 Best Practices

This chapter covers:

Writing clean, understandable DAGs using style conventions
Creating consistent approaches for managing credentials and configuration options
Generating repeated DAGs and task structures using factory functions and DAG/task configurations
Designing reproducible tasks by enforcing idempotency and determinism constraints, optionally using approaches inspired by functional programming
Handling data efficiently by limiting the amount of data processed in your DAG, as well as using efficient approaches for handling/storing (intermediate) datasets
Managing the resources of your (big) data processes by processing data in the most appropriate systems, whilst managing concurrency using resource pools

In previous chapters, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. In this chapter, we dive a bit deeper into some best practices that can help you write well architected DAGs that are both easy-to-understand and efficient in terms of how they handle your data and resources.

10.1 Writing clean DAGs

10.1.1 Use style conventions

10.1.2 Manage credentials centrally

10.1.3 Specify configuration details consistently

10.1.4 Avoid doing any computation in your DAG definition

10.1.5 Use factories to generate common patterns

10.1.6 Create new DAGs for big changes

10.2 Designing reproducible tasks

10.2.1 Always require tasks to be idempotent

10.2.2 Task results should be deterministic

10.2.3 Design tasks using functional paradigms

10.3 Handling data efficiently

10.3.1 Limit the amount of data being processed

10.3.2 Incremental loading/processing

10.3.3 Cache intermediate data

10.3.4 Don’t store data on local file systems

10.3.5 Offload work to external/source systems

10.4 Managing your resources

10.4.1 Manage concurrency using pools

10.4.2 Detect long running tasks using SLA’s and alerts

10.5 Summary