chapter five

5 Bandit algorithms for personalized marketing

This chapter covers

Framing e-commerce ad campaign as multi-armed and contextual bandit problems.
The exploration–exploitation trade-off.
Implementing core bandit algorithms — Epsilon-Greedy, UCB, and Thompson Sampling.
Extending to contextual bandits for personalized discount using LinUCB.

The greatest danger in times of turbulence is not the turbulence; it is to act with yesterday’s logic.

Peter Drucker, father of modern management

Up to this point, we’ve been working with the full Markov decision process framework. We’ve seen how to define states, actions, rewards, and transitions — and in chapter 4, we even solved a Markov decision process optimally using dynamic programming. That was the “ideal” case: we assumed we had the transition probabilities, we knew exactly how the system evolved, and we could compute the best policy with mathematical precision.

But here’s the problem — in most real business problems, that assumption collapses instantly. The transition function isn’t written down anywhere. It’s not in a database, not in a spreadsheet. It’s hidden in customer behavior, market trends, operational quirks, and a hundred other unknowns. And if we don’t know it, dynamic programming can’t help us. We need algorithms that can still learn to make good decisions without ever being told exactly how the world changes in response to our actions.

5 Bandit algorithms for personalized marketing

This chapter covers

5.1 Bandits as lightweight reinforcement learning

5.2 Tradeoff between exploitation and exploration

5.3 Simulating an Ad campaign problem with bandit algorithms

5.4 Quantifying bandit algorithms performance with Regret

5.5 Dynamic personalized discounting with contextual bandits

5.6 Beyond stationary bandit problems

5.7 Summary