5 Orchestrating ML pipelines
This chapter covers
- Building batch pipeline for model inference using Kubeflow pipelines
In Chapter 4, we established reliable tracking of ML experiments with MLflow and feature management with Feast. However, these tools still require manual intervention to coordinate model training, feature updates, and inference. This is where pipeline orchestration becomes crucial (Figure 5.1).
Figure 5.1 The mental map where we are now focusing on Kubeflow pipeline orchestration(A).
In this chapter, we'll use Kubeflow Pipelines to automate these workflows, making our ML systems more scalable and reproducible. Through a practical income classification example, we'll see how to turn manual steps into automated, reusable pipeline components. All the code for this chapter is available on GitHub: https://github.com/practical-mlops/chapter-5
5.1 Kubeflow Pipeline, the Task Orchestrator
Most machine learning inference pipelines have a common structure, we need to retrieve data from somewhere (object store, data warehouse, file system), pre-process that data, retrieve or load a model, and then perform inference. The inferences are then written to a database or uploaded to some cloud storage. A pipeline needs to run periodically, and we may need to provide some runtime parameters to it like date/time, feature table name, etc. All of this is possible using Kubeflow pipelines.