chapter five

5 Counterfactual evaluations

 

This chapter covers

  • Introducing causal inference
  • Illustrating the anatomy of a counterfactual evaluation
  • Understanding the strengths and limitations of counterfactual evaluations
  • Logging best practices for counterfactual evaluations

What if you could play out different scenarios without them actually happening in real life? What if you had taken that new job instead of staying at your current one? Or ordered the other dish on the menu? This kind of 'what if' thinking is essentially what counterfactual evaluations allow us to do.

With the right data, counterfactual evaluations can help you estimate what might have happened if an AI system had made a different decision. The key word here is estimate. Counterfactual evaluations are not magic replay buttons that reveal the one true alternate reality. Instead, they use logged production data, decision probabilities, observed outcomes, and a set of assumptions to estimate how a different model, ranking policy, recommendation policy, or decision strategy might have performed.

The last chapter detailed engineering system performance metrics that every AI practitioner should consider before introducing a model into production. We discussed things like dark-loading, latency degradation metrics, and key metrics for measuring system performance. Now we’re shifting focus from the systems that surround a model and back to the model’s decisions by exploring another powerful methodology: counterfactual evaluations.

5.1 What is causal inference

5.2 Counterfactual evaluations

5.2.1 Anatomy of a counterfactual evaluation

5.2.2 The assumptions hiding underneath counterfactual evaluations

5.2.3 Why counterfactual evaluations matter

5.3 What is counterfactual logging?

5.3.1 How counterfactual logging relates to causal inference

5.3.2 How much data is “enough”?

5.3.3 Data alone is not enough

5.4 Estimating outcomes

5.4.1 Policy value

5.4.2 Off-policy evaluations

5.4.3 Designing incremental action value for your domain

5.5 Logging best practices

5.5.1 Log all possible actions

5.5.2 Record action probabilities

5.5.3 Include contextual metadata

5.5.4 Consistent and unbiased logs are the best types logs

5.5.5 Practical example: search engine

5.6 Strengths and realities of counterfactual evaluations

5.7 Common pitfalls in counterfactual logging

5.7.1 Incomplete data logging

5.7.2 Biased propensity scores

5.7.3 Poor overlap between the logged policy and new policy

5.7.4 Ignoring contextual variables

5.8 Policy value evaluation for movie recommendations

5.8.1 The setup: data, design, and metrics