chapter five

5 Contextual bandits: Making targeted decisions

This chapter covers

Predicting the business metric outcome of a decision
Exploring decisions to reduce model bias
Exploring parameters to reduce model bias
Validating with an A/B test

Thus far we’ve conducted experiments that compared two or more different versions of a system: A/B testing and multi-armed bandits evaluated arbitrary changes, and RSM optimized a small number of continuous parameters. Contextual bandits, on the other hand, use experimentation to optimize multiple (potentially millions of) system parameters –- but they can do so only for a narrowly-defined type of system. Specifically, the system should consist of (i) a model that predicts the short-term, business-metric outcome of a decision and (ii) a component that makes decisions based on the model’s predictions. A contextual bandit is at the heart of any personalized service you might regularly use: news, social media, advertisements, music, movies, podcasts, etc. Tuning these systems’ parameters without experimentation can lead to suboptimal results and so-called “feedback loops” (see section 5.2.1).

5.1 Model a business metric offline to make decisions online

5.1.1 Model the business-metric outcome of a decision

5.1.2 Add the decision-making component

5.1.3 Run and evaluate the greedy recommender

5.2 Explore actions with epsilon-greedy

5.2.1 Missing counterfactuals degrade predictions

5.2.2 Explore with epsilon-greedy to collect counterfactuals

5.3 Explore parameters with Thompson sampling

5.3.1 Create an ensemble of prediction models

5.3.2 Randomized probability matching

5.4 Validate the contextual bandit

5.5 Summary