11 Scaling up: Best practices for production deployment
This chapter covers
- Challenges and deployment options to consider for an application ready for production
- Production best practices covering scalability, latency, caching, and managed identities
- Observability of LLM applications, with some practical examples
- LLMOps and how it compliments MLOps
When organizations are ready to take their generative AI models from the realm of proof of concept (PoC) to the real world of production, they embark on a journey that requires careful consideration of key aspects. This chapter will discuss deployment and scaling options, sharing best practices for making generative AI solutions operational, reliable, performant, and secure.
Deploying and scaling generative AI models in a production setting is a complex task that requires meticulous consideration of various factors. While building a PoC can be a thrilling way to test an idea’s feasibility, taking it to production introduces a whole new realm of operational, technical, and business considerations.
This chapter will focus on the key aspects developers must consider when deploying and scaling generative AI models in a production environment. We will discuss the operational criteria critical to monitoring the systems’ health, deployment options, and best practices for ensuring reliability, performance, and security.