chapter six

6 Large Language Models in Production:A practical guide

This chapter covers

How to structure an LLM service and tools to deploy
How to create and prepare a Kubernetes cluster for LLM deployment
Common production challenges and some methods to handle them
Deploying models to the edge

We did it. We arrived. This is the chapter we wanted to write when we first thought about writing this book. I remember the first model I ever deployed. Words can’t describe how much more satisfaction this gave me than the dozens of projects left to rot on my laptop. In my mind it sits on a pedestal, not because it was good, in fact, it was quite terrible, but because it was useful and actually used by those who needed it the most. It made an impact on the lives of those around me.

So what actually is production? "Production" refers to the phase where the model is integrated into a live or operational environment where it can perform its intended tasks or provide services to end-users. It's a crucial phase in making the model available for real-world applications and services. To this extent, we will show you how to package up an LLM into a service or API so that it can take on-demand requests. We will then show you how to set up a cluster in the cloud where you can deploy this service, and then share some challenges you may face in production with some tips to handle them. Lastly, we will talk about a different kind of production, deploying models on edge devices.

6.1 Creating an LLM Service

6.1.1 Model Compilation

6.1.2 LLM Storage strategies

6.1.3 Adaptive Request Batching

6.1.4 Flow Control

6.1.5 Streaming Responses

6.1.6 Feature Store

6.1.7 Retrieval-Augmented Generation

6.1.8 LLM Service Libraries

6.2 Setting up Infrastructure

6.2.1 Provisioning Clusters

6.2.2 Autoscaling

6.2.3 Rolling updates

6.2.4 Inference Graphs

6.2.5 Monitoring

6.3 Production Challenges

6.3.1 Model updates and retraining

6.3.2 Load testing

6.3.3 Troubleshoot Poor Latency

6.3.4 Resource Management

6.3.5 Cost Engineering

6.3.6 Security

6.4 Deploying to the Edge

6.5 Summary