chapter five

5 Contextual bandits: Making targeted decisions

This chapter covers

Predicting the business metric outcome of a decision
Exploring decisions to reduce model bias
Exploring parameters to reduce model bias
Validating with an A/B test

Thus far we’ve conducted experiments that compared two or more different versions of a system: A/B testing and multi-armed bandits evaluated arbitrary changes, and RSM optimized a small number of continuous parameters. Contextual bandits, in contrast, use experimentation to optimize multiple (potentially millions of) system parameters—but they can do so only for a narrowly defined type of system. Specifically, the system should consist of (1) a model that predicts the short-term, business-metric outcome of a decision and (2) a component that makes decisions based on the model’s predictions. A contextual bandit is at the heart of any personalized service you might regularly use: news, social media, advertisements, music, movies, podcasts, and so on. Tuning these systems’ parameters without experimentation can lead to suboptimal results and “feedback loops” (see section 5.2.1).

5.1 Model a business metric offline to make decisions online

5.1.1 Model the business-metric outcome of a decision

5.1.2 Add the decision-making component

5.1.3 Run and evaluate the greedy recommender

5.2 Explore actions with epsilon-greedy

5.2.1 Missing counterfactuals degrade predictions

5.2.2 Explore with epsilon-greedy to collect counterfactuals

5.3 Explore parameters with Thompson sampling

5.3.1 Create an ensemble of prediction models

5.3.2 Randomized probability matching

5.4 Validate the contextual bandit

Summary