5 Bandit algorithms for personalized marketing
This chapter covers
- Framing e-commerce ad campaign as multi-armed and contextual bandit problems.
- The exploration–exploitation trade-off.
- Implementing core bandit algorithms — Epsilon-Greedy, UCB, and Thompson Sampling.
- Extending to contextual bandits for personalized discount using LinUCB.
The greatest danger in times of turbulence is not the turbulence; it is to act with yesterday’s logic.
Peter Drucker, father of modern management
Up to this point, we’ve been working with the full Markov decision process framework. We’ve seen how to define states, actions, rewards, and transitions — and in chapter 4, we even solved a Markov decision process optimally using dynamic programming. That was the “ideal” case: we assumed we had the transition probabilities, we knew exactly how the system evolved, and we could compute the best policy with mathematical precision.
But here’s the problem — in most real business problems, that assumption collapses instantly. The transition function isn’t written down anywhere. It’s not in a database, not in a spreadsheet. It’s hidden in customer behavior, market trends, operational quirks, and a hundred other unknowns. And if we don’t know it, dynamic programming can’t help us. We need algorithms that can still learn to make good decisions without ever being told exactly how the world changes in response to our actions.