chapter six

6 Large language model services: A practical guide

This chapter covers

How to structure an LLM service and tools to deploy
How to create and prepare a Kubernetes cluster for LLM deployment
Common production challenges and some methods to handle them
Deploying models to the edge

The production of too many useful things results in too many useless people.
—Karl Marx

We did it. We arrived. This is the chapter we wanted to write when we first thought about writing this book. One author remembers the first model he ever deployed. Words can’t describe how much more satisfaction this gave him than the dozens of projects left to rot on his laptop. In his mind, it sits on a pedestal, not because it was good—in fact, it was quite terrible—but because it was useful and actually used by those who needed it the most. It affected the lives of those around him.

6.1 Creating an LLM service with RAG and more

6.1.1 Model compilation

6.1.2 LLM storage strategies

6.1.3 Adaptive request batching

6.1.4 Flow control

6.1.5 Streaming responses

6.1.6 Feature store

6.1.7 Retrieval-augmented generation

6.1.8 LLM service libraries

6.2 Setting up infrastructure

6.2.1 Provisioning clusters

6.2.2 Autoscaling

6.2.3 Rolling updates

6.2.4 Inference graphs

6.2.5 Monitoring

6.3 Cost engineering, security, latency, and other production challenges

6.3.1 Model updates and retraining

6.3.2 Load testing

6.3.3 Troubleshooting poor latency

6.3.4 Resource management

6.3.5 Cost engineering

6.3.6 Security

Summary