chapter six

6 Performance Optimization Techniques for LLMs and Agents

This chapter covers

Implementing core architectural patterns for LLM optimization including token streaming, batching, semantic caching, and multi-model fallback
Building a production-ready service that handles high traffic loads efficiently
Developing comprehensive evaluation strategies for both LLMs and agents
Creating robust testing frameworks using LLM-as-judge approaches and trajectory analysis

LLMs and agents have revolutionized the way we create intelligent systems, powering applications from conversational agents to advanced decision-making systems. They can draft personalized emails, answer complex questions, and take actions. But raw power isn’t enough when these systems are deployed in production. Without optimization, even the best LLMs can stumble under the weight of real-world demands.

Imagine this scenario: you’ve built a cutting-edge customer support agent for an online retail store. During normal usage, it works flawlessly—responding to customer queries, handling refunds, and recommending products. But during Black Friday, the system struggles:

Response times spike, frustrating customers who expect instant answers.
Costs skyrocket, as the system relies on large, expensive models for every query.
Errors creep in, with the agent providing inconsistent or even incorrect responses.

6 Performance Optimization Techniques for LLMs and Agents

This chapter covers

6.1 Essential architectural patterns for LLM-based systems

6.1.1 Token streaming: Presenting answers incrementally or all at once

6.1.2 Handling surges with batching—System-level vs. OpenAI’s Batch API

6.1.2 Caching for efficiency

6.1.3 Multi-model fallback: Matching each query to the right model

6.1.4 Project: Building an e-commerce LLM service with batching, caching, and model fallback

6.1.5 Further improvements

6.2 Evaluating agent performance

6.2.1 Core metrics for evaluating LLMs

6.2.2 Evaluating agents: Beyond traditional LLM metrics

6.3 Summary

6.4 References