5 Hosting, scaling, and load testing
This chapter covers
- Choosing a way to deploy your app
- Containerizing your app
- Wiring your Azure Web App to GitHub for automatic builds & releases
- Scaling up to handle many queries
- Using load testing tools like Locust to ensure your RAG app doesn’t break under pressure
Up to now our RAG chatbot has lived a sheltered life: a single process on one laptop, an in-memory SQLite file, zero real users. That’s perfect for experimentation, but production traffic is a far less forgiving audience. Conversations arrive in bursts, browser tabs multiply, and sooner or later someone in finance asks why chatbot takes 5 minutes to answer. This chapter is the bridge between “it works on my machine” and “it survives a stampede.”
We’ll start by talking about statelessness, the north-star principle behind modern deployment. If any copy of our service should be able to handle any request, then local disks are off-limits for durable storage, configuration must travel through environment variables, and startup needs to be fast enough that a cluster can kill and replace instances at will. That philosophy naturally points us toward containers, because a container captures every library, build step, and port exposure in a single artifact that can boot identically on a developer laptop, an Azure Web App Service, or a Kubernetes node running on a computer in your garage.