chapter six

6 Evaluating models in an A/B test

 

This chapter covers

  • Defining the fundamentals of an A/B test
  • Illustrating quirks and characteristics of evaluating AI models in A/B Test setting
  • Interpreting low-signal results from online evaluations
  • Defining the right time to A/B test

In reality, your AI model is only as successful as the user metrics evaluating it. Another way to say this is that offline metrics are great and absolutely required before shipping a model to a production user facing setting, but at the end of the day the goal is to improve the user experience. Offline evaluations can catch model degradations between versions and trust and safety issues before they reach the hands of users, but you can't ultimately know the real impact of a model without online data.

Offline evaluations can tell you whether a model looks promising before launch. An A/B test tells you what happens when that model interacts with real users, real product surfaces, real latency constraints, and real behavioral feedback loops.

6.1 A/B testing

6.1.1 Why A/B test

6.2 The ‘right time’ to A/B test a model

6.2.1 Model maturity

6.2.2 Timelines

6.2.3 How offline evaluations influence A/B test readiness

6.3 Designing an A/B test

6.3.1 Operational setup

6.3.2 Experimental design

6.3.3 Hypothesis and purpose

6.3.4 Model purpose & intent

6.3.5 Experiment validity checks before interpreting results

6.3.6 Movie recommendation A/B test example

6.4 Prerequisites for running A/B tests

6.4.1 Latency monitoring

6.4.2 Feature drift monitoring

6.4.3 Logging still matters

6.5 Quirks of A/B testing models

6.5.1 Multiple model variants increase complexity

6.5.2 Novelty effect

6.5.3 Model warm-up and cold start effects

6.5.4 Variance and sensitivity to noise

6.5.5 Offline and online mismatch

6.6 Interpreting ambiguous or low-signal results

6.7 Engineering considerations

6.7.1 Infrastructure-induced latency or errors

6.7.2 If you just don't have an A/B testing platform

6.8 Summary