chapter three

3 Building an ML platform in Kubernetes

This chapter covers

Setting up an Amazon EKS Kubernetes cluster using Terraform
Creating an Ingress using NGINX Ingress Controller
Deploy an identity provider using Keycloak
Creating a scalable data science development environment using JupyterHub
Enabling GPU workloads in Kubernetes

After learning the fundamentals of Kubernetes in the previous chapter, you are now prepared to deploy the toolchains that will power a scalable, secure, and resilient system for your ML projects. While in the previous chapter, you may have used a cluster running on your machine to run a Kubernetes cluster, in this chapter, you will deploy a production-grade Kubernetes cluster using Amazon EKS.

When operating in the cloud, it is best to utilize managed services (like Amazon EKS) whenever they suit your needs. Managed services allow organizations to outsource the management of infrastructure resources to the cloud provider. For example, creating a Kubernetes cluster from scratch is complicated process. But in the cloud, you can create a production-grade Kubernetes cluster in minutes. By leveraging managed services, you reduce your operational burden and benefit from the expertise and resources of the cloud provider. These services have an edge over self-managed setups, for they are typically more reliable, scalable, and secure. These services are built and maintained by teams of experts who specialize in those services.

3.1 Creating a Kubernetes cluster using Amazon EKS

3.2 Enable integration with cloud services

3.2.1 Enabling load balancing

3.2.2 Enabling cluster autoscaling

3.2.3 Providing persistent storage

3.3 Setting up an identity system

3.3.1 Configuring DNS

3.3.2 Getting a TLS certificate

3.3.3 Installing Ingress Controller

3.3.4 Deploying Keycloak

3.3.5 Preparing Keycloak for client authentication

3.3.6 Create a user in Keycloak

3.3.7 Understanding the authentication workflow

3.4 Creating a self-service development environment

3.4.1 Deploy Jupyterhub

3.4.2 Role-Based Access Control with Keycloak

3.4.3 Providing persistence to notebook servers

3.4.4 Customizing user environment

3.4.5 Enabling GPU workloads in Kubernetes

3.4.6 Reducing wastage by shutting down idle notebooks

3.4.7 Optimizing GPU node utilization

3.5 Summary