13 Orchestrating data pipelines

 

This chapter covers

  • Orchestrating data pipelines with Snowflake tasks
  • Sending notifications from tasks
  • Orchestrating with task graphs
  • Monitoring data pipeline execution
  • Troubleshooting data pipeline failures

Data pipelines are a series of steps that perform data ingestion and transformation. They are usually scheduled to run at predefined times, often at night, to ensure that business users have fresh data every morning. If users need more recent data, data engineers can schedule the pipelines to run more frequently, such as every hour or every few minutes.

Since data pipelines involve many steps, data engineers ensure the steps are executed in the correct sequence. Data engineers must also have visibility into the data pipeline execution, including how long it took, how much data it ingested, and whether it finished successfully. The process that involves scheduling, defining dependencies, error handling, and sending notifications to ensure efficient execution of data pipeline steps is called data pipeline orchestration.

13.1 Orchestrating with Snowflake tasks

13.1.1 Creating a schema to store the orchestration objects

13.1.2 Designing the orchestration tasks

13.1.3 Creating tasks with dependencies

13.2 Sending email notifications

13.3 Orchestrating with task graphs

13.3.1 Designing the task graph

13.3.2 Creating the root task

13.3.3 Creating the finalizer task

13.3.4 Viewing the task graph

13.4 Monitoring data pipeline execution

13.4.1 Adding logging functionality to tasks

13.4.2 Summarizing logging information in an email notification

13.5 Troubleshooting data pipeline failures

Summary