6 Distributed computing with Ray and Kubernetes
This chapter covers:
- Introduction to Ray
- Setting up Ray clusters on Kubernetes
- Distributing model training with Ray Train
- Optimizing hyperparameters with Ray Tune
- Tracking experiments with MLflow
- Serving models using Ray Serve
In Chapter 5, we explored how to use Kubernetes for single/multi-node model training. We saw how Kubeflow automates setting up a cluster of model trainers in Kubernetes, streamlining the process of managing complex machine learning workflows. As far as the extent of the Kubeflow project is concerned, we’ve barely scratched the surface. Its capabilities extend far beyond model training. We shall return to Kubeflow, but for now, we take a detour to discuss Ray, which is a versatile framework that uses Kubernetes for distributing data analytics and machine learning workloads.
6.1 Introduction to Ray
Ray is a distributed computing framework that has gained significant traction in the machine learning community due to its simplicity and scalability. It offers a unified API for distributed computing that’s especially useful when handling data analytics and machine learning workloads at scale.