11 Scaling Up: Best Practices for Production Deployment
This chapter covers
- Challenges and deployment options to consider for an application ready for production
- Production best practices covering scalability, latency, caching, managed identities
- Overview and practical examples of Observability of LLM applications
- Overview of LLMOps and how it compliments MLOps
When organizations are ready to take their generative AI models from the realm of Proof of Concept to the real world of production, they embark on a journey that requires careful consideration of key aspects. This chapter is your guide, discussing deployment and scaling options and sharing best practices for making generative AI solutions operational, reliable, performant, and secure.
Deploying and scaling generative AI models in a production setting is a complex task requiring meticulous consideration of various factors. While building a proof of concept (PoC) can be a thrilling way to test an idea's feasibility, taking it to production introduces a whole new realm of operational, technical, and business considerations.
This chapter will focus on the key aspects developers must consider when deploying and scaling generative AI models in a production environment. We will discuss the operational criteria critical to monitoring the systems' health, deployment options, and best practices for ensuring reliability, performance, and security.