5 Counterfactual evaluations
This chapter covers
- Introducing causal inference
- Illustrating the anatomy of a counterfactual evaluation
- Understanding the strengths and limitations of counterfactual evaluations
- Logging best practices for counterfactual evaluations
What if you could play out different scenarios without them actually happening in real life? What if you had taken that new job instead of staying at your current one? Or ordered the other dish on the menu? This kind of 'what if' thinking is essentially what counterfactual evaluations allow us to do.
With the right data, counterfactual evaluations can help you estimate what might have happened if an AI system had made a different decision. The key word here is estimate. Counterfactual evaluations are not magic replay buttons that reveal the one true alternate reality. Instead, they use logged production data, decision probabilities, observed outcomes, and a set of assumptions to estimate how a different model, ranking policy, recommendation policy, or decision strategy might have performed.
The last chapter detailed engineering system performance metrics that every AI practitioner should consider before introducing a model into production. We discussed things like dark-loading, latency degradation metrics, and key metrics for measuring system performance. Now we’re shifting focus from the systems that surround a model and back to the model’s decisions by exploring another powerful methodology: counterfactual evaluations.