11 Scaling up: Best practices for production deployment

 

This chapter covers

  • Challenges and deployment options to consider for an application ready for production
  • Production best practices covering scalability, latency, caching, and managed identities
  • Observability of LLM applications, with some practical examples
  • LLMOps and how it compliments MLOps

When organizations are ready to take their generative AI models from the realm of proof of concept (PoC) to the real world of production, they embark on a journey that requires careful consideration of key aspects. This chapter will discuss deployment and scaling options, sharing best practices for making generative AI solutions operational, reliable, performant, and secure.

Deploying and scaling generative AI models in a production setting is a complex task that requires meticulous consideration of various factors. While building a PoC can be a thrilling way to test an idea’s feasibility, taking it to production introduces a whole new realm of operational, technical, and business considerations.

This chapter will focus on the key aspects developers must consider when deploying and scaling generative AI models in a production environment. We will discuss the operational criteria critical to monitoring the systems’ health, deployment options, and best practices for ensuring reliability, performance, and security.

11.1 Challenges for production deployments

11.2 Deployment options

11.3 Managed LLMs via API

11.4 Best practices for production deployment

11.4.1 Metrics for LLM inference

11.4.2 Latency

11.4.3 Scalability

11.4.4 PAYGO

11.4.5 Quotas and rate limits

11.4.6 Managing quota

11.4.7 Observability

11.4.8 Security and compliance considerations

11.5 GenAI operational considerations

11.5.1 Reliability and performance considerations

11.5.2 Managed identities

11.5.3 Caching

11.6 LLMOps and MLOps

11.7 Checklist for production deployment

Summary