7 Observability and experimentation: seeing and improving what AI does
This chapter covers
- The observability data model: sessions, traces, spans, generations, and scores
- Structured logging that captures AI-specific context like token usage, model decisions, and safety evaluations
- Distributed tracing that follows a single user request across every platform service
- Attaching quality scores to production traces through automated, model-based, and human evaluation
- Cost attribution and budget tracking
- Designing the Experimentation Service for target lifecycle management, evaluation and A/B testing
- Building the improvement loop that connects observability to experimentation
Every platform service we've built so far produces valuable signals. The Model Service records token counts and latency on every request. The Session Service tracks conversation lengths and context window utilization. The Data Service measures retrieval relevance scores. The Guardrails Service logs every policy evaluation and its outcome. But these signals exist in isolation. When Sarah's patient intake assistant takes four seconds to respond instead of the usual one second, she can't tell whether the delay came from a slow model call, an expensive vector search, a guardrail evaluation that triggered a secondary classification, or a tool execution that timed out. The data exists somewhere in each service's local logs, but nothing ties it together into a coherent story.