10 Best Practices

 

This chapter covers:

  • Writing clean, understandable DAGs using style conventions
  • Creating consistent approaches for managing credentials and configuration options
  • Generating repeated DAGs and task structures using factory functions and DAG/task configurations
  • Designing reproducible tasks by enforcing idempotency and determinism constraints, optionally using approaches inspired by functional programming
  • Handling data efficiently by limiting the amount of data processed in your DAG, as well as using efficient approaches for handling/storing (intermediate) datasets
  • Managing the resources of your (big) data processes by processing data in the most appropriate systems, whilst managing concurrency using resource pools

In previous chapters, we have described most of the basic elements that go into building and designing data processes using Airflow DAGs. In this chapter, we dive a bit deeper into some best practices that can help you write well architected DAGs that are both easy-to-understand and efficient in terms of how they handle your data and resources.

10.1  Writing clean DAGs

 

10.1.1    Use style conventions

 
 
 

10.1.2    Manage credentials centrally

 
 

10.1.3    Specify configuration details consistently

 

10.1.4    Avoid doing any computation in your DAG definition

 
 
 

10.1.5    Use factories to generate common patterns

 
 
 
 

10.1.6    Create new DAGs for big changes

 
 
 

10.2  Designing reproducible tasks

 
 

10.2.1    Always require tasks to be idempotent

 
 

10.2.2    Task results should be deterministic

 
 
 

10.2.3    Design tasks using functional paradigms

 

10.3  Handling data efficiently

 
 
 

10.3.1    Limit the amount of data being processed

 
 
 
 

10.3.2    Incremental loading/processing

 
 

10.3.3    Cache intermediate data

 
 

10.3.4    Don’t store data on local file systems

 
 

10.3.5    Offload work to external/source systems

 

10.4  Managing your resources

 
 
 

10.4.1    Manage concurrency using pools

 
 
 

10.4.2    Detect long running tasks using SLA’s and alerts

 
 
 

10.5  Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage