8 Pitfalls of online metrics

 

This chapter covers

  • Detailing pitfalls of online metrics unique to AI models
  • Illustrating ideal metric frameworks that align with model and product goals
  • Exploring technical refinements to strengthen evaluations

There’s probably a strategy to avoid getting tricked by your own metrics…right? Well yes, of course but it will require you to be intentional and thoughtful about it. If you’re not, it’s very easy for your A/B test online evaluations to spin their wheels without really giving you insights you can trust.

In Chapter 7 we discussed how to bridge offline signals into A/B test evaluations and set up your model for online testing in a way that reflects product strategy and stakeholder alignment. This chapter picks up from there. It may feel like a small manifesto on how not to be fooled by online metrics, but it’s not just that! We’ll also explore principles for a healthy metric framework and some technical refinements to get more out of the precious time your model is being evaluated in an online A/B test.

Hopefully by the end, you’ll see this as the perfect capstone for the AI model online evaluation focus in Part 2, underscoring that online metrics aren’t enough on their own. Balanced evaluations (offline, online, human, and LLM-as-a-judge) are what make AI development truly trustworthy which is what this book is all about!

8.1 Choosing metrics that actually matter

Almost every AI model has two key characteristics:

8.1.1 Weak metrics waste the offline investment

8.1.2 Contract between product strategy and AI model evaluation

8.2 Feedback loops unique to AI models

8.2.1 How to combat feedback loops in evaluations

8.3 Metric gaming and model exploitation

8.3.1 Reward hacking in AI models

8.3.2 Reward hacking isn’t limited to models

8.3.3 Why this risk is sharper in AI A/B tests

8.3.4 Combating reward hacking and exploitation

8.4 The blind spots of online metrics

8.4.1 Fairness across segments

8.4.2 Robustness in edge cases

8.4.3 Qualities like trust, coherence, or novelty that resist simple measurement

8.5 The importance of the right metrics framework for AI

8.5.1 Building metric hierarchies

8.5.2 Revisiting metrics as models and products evolve

8.5.3 Aligning evaluation targets with both product and model goals

8.6 Statistical refinements for AI online metrics

8.6.1 Top-line averages may misrepresent specific user groups

8.6.2 Variance reduction techniques (CUPED, regression adjustment)

8.6.3 Pitfalls of covariate choices

8.7 Engineering considerations

8.7.1 Logging and attribution in ensemble or blended models

8.7.2 Monitoring for metric drift in retraining pipelines

8.7.3 Guardrails and alert fatigue

8.7.4 Guarding against variance inflation

8.8 Summary