chapter six
6 Evaluating models in an A/B test
This chapter covers
- Defining the fundamentals of an A/B test
- Illustrating quirks and characteristics of evaluating AI models in A/B Test setting
- Interpreting low-signal results from online evaluations
- Defining the right time to A/B test
In reality, your AI model is only as successful as the user metrics evaluating it. Another way to say this is that offline metrics are great and absolutely required before shipping a model to a production user facing setting, but at the end of the day the goal is to improve the user experience. Offline evaluations can catch model degradations between versions and trust and safety issues before they reach the hands of users, but you can't ultimately know the real impact of a model without online data.
Offline evaluations can tell you whether a model looks promising before launch. An A/B test tells you what happens when that model interacts with real users, real product surfaces, real latency constraints, and real behavioral feedback loops.