4 Orchestration

 

In this chapter:

  • Building a data ingestion pipeline
  • Introducing Azure Data Factory
  • DevOps for Azure Data Factory
  • Monitoring with Azure Monitor

We’ll look at the final pieces of core infrastructure for our data platform: orchestration (and monitoring). DevOps is where we store all our code and configurations and from which we deploy our services. The storage layer is where we ingest data and on top of which we run our workloads. Orchestration is the layer which handles data movement and all other automated processing. Figure 4.1 highlights the platform layer we’ll focus on during this chapter.

Figure 4.1 The orchestration layer handles scheduling for all tasks and data movement into and out of the data platform.

We’ll start with a real-world scenario: ingesting the Bing COVID-19 open dataset into our data platform. Microsoft provides several open datasets for everyone to use. One of them is tracking COVID-19 cases. We’ll create a pipeline to bring this dataset into our Azure Data Explorer cluster.

We’ll be using Azure Data Factory (ADF) for this. As a reminder, Azure Data Factory is Azure’s cloud ETL service for scale-out serverless data integration and data transformation. We’ll spin up an Azure Data Factory instance, set up the pipeline, and get an overview of the Azure Data Factory components. After running the pipeline, we will have the data in our Azure Data Explorer cluster.

4.1      Ingesting the Bing COVID-19 open dataset

4.2      Introducing Azure Data Factory

4.2.1   Setting up the data source

4.2.2   Setting up the data sink

4.2.3   Setting up the pipeline

4.2.4   Setting up a trigger

4.2.5   Orchestrating with Azure Data Factory

4.3      DevOps for Azure Data Factory

4.3.1   Deploying Azure Data Factory from Git

4.3.2   Setting up access control

4.3.3   Deploying the production Data Factory

4.3.4   DevOps for Data Factory recap

4.4      Monitoring with Azure Monitor

4.5      Summary

sitemap