chapter five

5 Training machine learning models on Kubernetes

This chapter covers

Understanding distributed machine learning training
Distributed training with TensorFlow and Kubeflow
Distributed training with PyTorch and Kubeflow
Using alternate schedulers with Kubernetes

In Chapter 4, you learned how to use distributed architectures to speed up data analytics. It’s now time to explore how Kubernetes helps you scale machine learning, specifically, model training.

But first we must answer why you should consider Kubernetes when training models. Of all steps in the ML development lifecycle, model training is the most resource intensive process. Classical ML models like simple classification and regression models, the kinds that were the rage before Transformers came into being, needed far fewer resources than today’s multimodal foundation models. Since these classical models were trained on a much smaller volume of data, training them didn’t take long.

5.1 Distributed training in a nutshell

5.2 Distributed training with TensorFlow

5.2.1 TensorFlow cluster

5.2.2 MultiWorkerMirroredStrategy

5.2.3 ParameterServerStrategy

5.3 Training models with Kubeflow

5.3.1 Installing Kubeflow Trainer

5.3.2 Training TensorFlow models with Kubeflow

5.3.3 Recapping the process

5.4 Distributed training with PyTorch

5.4.1 PyTorch training with Kubeflow

5.5 Running MPI jobs with Kubeflow

5.5.1 Fault tolerant training with Elastic Horovod

5.6 Improving efficiency with Alternate Kubernetes schedulers

5.7 Optimizing distributed training

5.8 Cleanup

5.9 Summary