8 Pitfalls of online metrics
This chapter covers
- Detailing pitfalls of online metrics unique to AI models
- Illustrating ideal metric frameworks that align with model and product goals
- Exploring technical refinements to strengthen evaluations
There’s probably a strategy to avoid getting tricked by your own metrics…right? Well yes, of course but it will require you to be intentional and thoughtful about it.
In Chapter 7, we discussed how to bridge offline signals into A/B test evaluations and prepare a model for online testing in a way that reflects product strategy and stakeholder alignment. This chapter picks up from there by focusing on what can go wrong once the model is online: weak proxies, feedback loops, reward hacking, segment blind spots, metric drift, and noisy treatment effects.
Hopefully by the end, you’ll see this as the perfect capstone for the AI model online evaluation focus in Part 2, underscoring that online metrics aren’t enough on their own. Balanced evaluations (offline, online, human, and LLM-as-a-judge) are what make AI development truly trustworthy which is what this book is all about!
8.1 Choosing metrics that actually matter
Almost every AI model has two key characteristics:
- It takes a long time to develop (unless you have an army of engineers contributing to it, but even then it's not a quick development cycle)
- Is by nature a dynamic entity living in a production system.