chapter nine

9 Model Training and Validation: Part 2

This chapter covers

Storing and retrieving datasets with VolumOps
Using MLFLow and Tensorboard to track and visualize training
Importance of lineage and experiment tracking

We left off in the last chapter after creating a simple pipeline to train the image detection model and talked a bit about storing and retrieving datasets. We also tried out the model locally to get a feel of how it works and to sort out obvious flaws. In this chapter, we take this a step further and implement steps to improve the robustness of the training pipeline and more importantly to bring visibility into the training process. As we build more models and the number of stakeholders in the model lifecycle increase, it becomes more important to have traceability and asynchronous observability in the training process. We also dive into Tensorboard that enables visibility into the model training part and then switch focus to MLFLow that enables lineage and model versioning. While not strictly necessary to train a model, this chapter dives into concepts that help us deliver models to production more comfortably and repeatedly with deterministic results.

Let's first start off with diving into a different way of accessing data within a pipeline, the VolumeOp.

9.1 Storing Data with VolumeOp

9.1.1 Creating a VolumeOp

9.1.2 Download Op using VolumeOp

9.1.3 Splitting the Dataset Directly

9.1.4 Simplifying Model Training

9.1.5 Simplifying Model Validation

9.2 Tracking Training with TensorBoard

9.3 Movie recommender project

9.3.1 Reading data from MinIO and quality assurance.

9.3.2 Model training component

9.3.3 Metrics for evaluation

9.3.4 Experiment tracking with MLFlow

9.3.5 Model registry with MLFlow

9.3.6 Creating a pipeline from components

9.3.7 Local inference in a notebook

9.4 Summary