7 From offline evaluation to live experiment
This chapter covers
- Creating a model selection playbook
- Evaluating model output in a sandbox environment before testing online
- Managing stakeholder expectations and communications
- Translating offline evaluation metrics into online watchpoints
A/B testing is part of a bigger evaluation system, it’s not just an isolated tactic to understand the effect of an AI model on a product.
In Chapter 6, we discussed how to run an A/B test well, including timing, design layers, quirks, and prerequisites. In this chapter, we’ll carry on the A/B testing scope but reframe slightly to focus on how you decide which models to even put into an online evaluation setting and how you monitor during the test in a way that complements offline learnings. We’ll reference some concepts from chapter 6, including a model’s maturity and how that relates to the model selection playbook for an A/B test. We will also introduce frameworks for graduating models from offline to online tests, monitoring them with targeted metrics, and internal testing before exposing them to users.
The goal of this chapter is to create a practical model selection pipeline that bridges offline evaluations from Part 1 with Chapter 6’s online experimentation practices. Let’s start with exactly that by introducing the model selection playbook.