chapter seven

7 Observability and experimentation: seeing and improving what AI does

This chapter covers

The observability data model: sessions, traces, spans, generations, and scores
Structured logging that captures AI-specific context like token usage, model decisions, and safety evaluations
Distributed tracing that follows a single user request across every platform service
Attaching quality scores to production traces through automated, model-based, and human evaluation
Cost attribution and budget tracking
Designing the Experimentation Service for target lifecycle management, evaluation and A/B testing
Building the improvement loop that connects observability to experimentation

Every platform service we've built so far produces valuable signals. The Model Service records token counts and latency on every request. The Session Service tracks conversation lengths and context window utilization. The Data Service measures retrieval relevance scores. The Guardrails Service logs every policy evaluation and its outcome. But these signals exist in isolation. When Sarah's patient intake assistant takes four seconds to respond instead of the usual one second, she can't tell whether the delay came from a slow model call, an expensive vector search, a guardrail evaluation that triggered a secondary classification, or a tool execution that timed out. The data exists somewhere in each service's local logs, but nothing ties it together into a coherent story.

7.1 Why AI systems need specialized observability

7.1.1 From infrastructure health to output quality

7.1.2 Cross-service correlation

7.1.3 Cost, and the bridge to experimentation

7.2 The observability data model

7.2.1 Deriving the model from questions

7.2.2 How the primitives connect: tracing a single request

7.2.3 Logs and metrics: complementing the request-level model

7.3 The Observability Service contract

7.3.1 Walking through the contract

7.3.2 Ingestion vs query patterns

7.4 Structured logging, metrics, and distributed tracing

7.4.1 Structured logging for AI-specific debugging

7.4.2 Metrics collection

7.4.3 Distributed tracing and the debugging workflow

7.5 How platform services report telemetry

7.5.1 TracedService: automatic instrumentation

7.5.2 Domain-specific telemetry

7.5.3 The observability client: batching and buffering

7.5.4 Adding custom observability beyond the defaults

7.6 Quality scores and cost attribution

7.6.1 Scores: measuring response quality

7.6.2 Cost attribution and budget tracking

7.7 The Experimentation service

7.7.1 Service contract

7.7.2 Target lifecycle and evaluation

7.9 A/B testing infrastructure