chapter nine

9 Model training and validation: Part 2

This chapter covers

Storing and retrieving datasets with Kubernetes PersistentVolumes
Using MLflow and TensorBoard to track and visualize training
Understanding the importance of lineage and experiment tracking

In production ML systems, effective model training extends beyond just algorithms and datasets—it requires robust infrastructure for data management, experiment tracking, and model versioning (figure 9.1). While chapter 8 focused on building basic training pipelines, this chapter tackles the challenges of scaling these pipelines for production use. Through hands-on examples using both our ID card detection and movie recommendation systems, we’ll explore how to manage large datasets efficiently with Kubernetes PersistentVolumes (PVs), track experiments systematically with MLflow and TensorBoard, and maintain clear model lineage for production deployments.

Figure 9.1 The mental map where we continue focusing on the second and third step/component of our pipeline—model training (4) and evaluation (5)

A screenshot of a computer

AI-generated content may be incorrect.

9.1 Storing data with PersistentVolumeClaim

9.1.1 Refactoring the pipeline with a PVC

9.1.2 Efficient dataset management

9.1.3 Creating a VolumeOp

9.1.4 Download Op using PVC

9.1.5 Splitting the dataset directly

9.1.6 Simplifying model training

9.1.7 Simplifying model validation

9.2 Tracking training with TensorBoard

9.2.1 Launching a new TensorBoard

9.2.2 Exploring YOLOv8’s default graphs

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLflow

9.3.5 Model registry with MLflow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook