chapter six

6 Evaluating models in an A/B test

This chapter covers

Defining the fundamentals of an A/B test
Illustrating quirks and characteristics of evaluating AI models in A/B Test setting
Interpreting low-signal results from online evaluations
Defining the right time to A/B test

What hasn’t already been said about A/B testing? It’s the conduit for innovation, for insights, for really understanding the effect of a change on a product. A/B testing is one of the most critical steps in not just the model development lifecycle but any feature that's built for a user facing product.

In Part 1 of this book, we detailed model offline evaluations, including diagnostic, performance and counterfactual evaluations, but that's just a portion of model evaluation strategy. It’s very important to measure the effect of a model in an online setting, such an A/B test. A/B testing, or online controlled experiments, is a very common practice at this point.

Now when you bring AI into the picture, this online experimentation methodology comes with its own quirks and characteristics. Subtle feedback loops, shifting user behavior, and high-variance metrics start to blur the clean lines of statistical testing.

This chapter unpacks the quirks, pitfalls, and practical aspects that come from evaluating AI models in a real world, online setting.

6.1 A/B testing

6.1.1 Why A/B test

6.2 The ‘right time’ to A/B test a model

6.2.1 Model maturity

6.2.2 Timelines

6.2.3 Offline evaluations influences

6.3 Designing an A/B test

6.3.1 Operational setup

6.3.2 Experimental design

6.3.3 Hypothesis and purpose

6.3.4 Model purpose & intent

6.3.5 Movie recommendation A/B test example

6.4 Prerequisites for running A/B tests

6.4.1 Latency monitoring

6.4.2 Feature drift monitoring

6.4.3 Logging still matters

6.5 Quirks of A/B testing models

6.5.1 More variants is typical but does increase testing capacity needs

6.5.2 Model warm-up and cold start effects

6.5.3 Variance and sensitivity to noise

6.5.4 Bias from training data

6.6 Interpreting ambiguous or low-signal results

6.7 Engineering considerations

6.7.1 Infrastructure-induced latency or errors

6.7.2 If you just don't have an A/B testing platform

6.8 Summary