5 Training machine learning models on Kubernetes
This chapter covers
- Understanding distributed machine learning training
- Distributed training with TensorFlow and Kubeflow
- Distributed training with PyTorch and Kubeflow
- Using alternate schedulers with Kubernetes
In Chapter 4, you learned how to use distributed architectures to speed up data analytics. It’s now time to explore how Kubernetes helps you scale machine learning, specifically, model training.
But first we must answer why you should consider Kubernetes when training models. Of all steps in the ML development lifecycle, model training is the most resource intensive process. Classical ML models like simple classification and regression models, the kinds that were the rage before Transformers came into being, needed far fewer resources than today’s multimodal foundation models. Since these classical models were trained on a much smaller volume of data, training them didn’t take long.