chapter five

5 Counterfactual evaluations

This chapter covers

Introducing causal inference
Illustrating the anatomy of a counterfactual evaluation
Understanding the strengths and limitations of counterfactual evaluations
Logging best practices for counterfactual evaluations

What if you could play out different scenarios without them actually happening in real life? What if you had taken that new job instead of staying at your current one? Or ordered the other dish on the menu? This kind of 'what if' thinking is essentially what counterfactual evaluations allow us to do.

With the right data, counterfactual evaluations can help you understand what could have happened if an AI model had made a different decision. The last chapter detailed engineering system performance metrics that every AI practitioner should consider before introducing a model into a production setting. We discussed things like dark-loading, latency degradation metrics and key metrics to consider when measuring systems performance. Now in this chapter, we’re shifting focus from the systems that surround a model and back to the model itself by exploding another powerful methodology: counterfactual evaluations.

5.1 What is causal inference

Let’s first talk about causal inference before we explore counterfactual evaluations. Causal inference is all about answering a simple yet profound question: what caused what?

5.2 Counterfactual evaluations

5.2.1 Anatomy of a counterfactual evaluation

5.2.2 Why counterfactual evaluations matter

5.3 What is counterfactual logging?

5.3.1 How counterfactual logging relates to causal inference

5.3.2 How much data is “enough”?

5.3.3 Data alone is not enough

5.4 Estimating outcomes

5.4.1 Policy value

5.4.2 Off-policy evaluations

5.4.3 Incremental action value

5.4.4 Designing incremental action value for your domain

5.5 Logging best practices

5.5.1 Log all possible actions

5.5.2 Record action probabilities

5.5.3 Include contextual metadata

5.5.4 Consistent and unbiased logs are the best types logs

5.5.5 Practical example: search engine

5.6 Strengths and realities of counterfactual evaluations

5.7 Common pitfalls in counterfactual logging

5.7.1 Incomplete data logging

5.7.2 Biased propensity scores

5.7.3 Ignoring contextual variables

5.8 Policy value evaluation for movie recommendations

5.8.1 The setup: data, design, and metrics

5.8.2 Challenges along the way

5.8.3 Outcomes and insights

5.8.4 Making it actionable

5.9 Engineering considerations

5.9.1 Balancing system performance and logging

5.9.2 Massive amounts of data

5.10 Summary