chapter six

6 Distributed computing with Ray and Kubernetes

This chapter covers:

Introduction to Ray
Setting up Ray clusters on Kubernetes
Distributing model training with Ray Train
Optimizing hyperparameters with Ray Tune
Tracking experiments with MLflow
Serving models using Ray Serve

In Chapter 5, we explored how to use Kubernetes for single/multi-node model training. We saw how Kubeflow automates setting up a cluster of model trainers in Kubernetes, streamlining the process of managing complex machine learning workflows. As far as the extent of the Kubeflow project is concerned, we’ve barely scratched the surface. Its capabilities extend far beyond model training. We shall return to Kubeflow, but for now, we take a detour to discuss Ray, which is a versatile framework that uses Kubernetes for distributing data analytics and machine learning workloads.

6.1 Introduction to Ray

Ray is a distributed computing framework that has gained significant traction in the machine learning community due to its simplicity and scalability. It offers a unified API for distributed computing that’s especially useful when handling data analytics and machine learning workloads at scale.

6.1.1 Anatomy of a Ray cluster

6.1.2 Setting up a Ray cluster with KubeRay

6.2 Running workloads on a Ray cluster

6.2.1 Using Ray interactively from notebooks

6.3 Customizing a cluster using RayCluster

6.4 Running jobs in a Ray Cluster

6.4.1 Submitting Ray Jobs using API

6.5 Adding fault tolerance to a Ray cluster

6.6 Running batch workloads with RayJob

6.7 Hyperparameter tuning with Ray

6.7.1 Tracking experiments with MLflow

6.8 Inference with Ray Serve

6.9 Scaling Ray Serve

6.10 Comparing Ray with Kubeflow

6.11 Summary