4 Balancing the gathering and use of information

In this chapter

You will learn about the challenges of learning from evaluative feedback and how to properly balance the gathering and utilization of information.
You will develop exploration strategies that accumulate low levels of regret in problems with unknown transition function and reward signals.
You will write code with trial-and-error learning agents that learn to optimize their behavior through their own experiences in many-options, one-choice environments known as multi-armed bandits.

Uncertainty and expectation are the joys of life. Security is an insipid thing.

— William Congreve English playwright and poet of the Restoration period and political figure in the British Whig Party

No matter how small and unimportant a decision may seem, every decision you make is a trade-off between information gathering and information exploitation. For example, when you go to your favorite restaurant, should you order your favorite dish, yet again, or should you request that dish you’ve been meaning to try? If a Silicon Valley startup offers you a job, should you make a career move, or should you stay put in your current role?

The challenge of interpreting evaluative feedback

Bandits: Single-state decision problems

Regret: The cost of exploration

Approaches to solving MAB environments

Greedy: Always exploit

Random: Always explore

Epsilon-greedy: Almost always greedy and sometimes random

Decaying epsilon-greedy: First maximize exploration, then exploitation

Optimistic initialization: Start off believing it’s a wonderful world

Strategic exploration

Softmax: Select actions randomly in proportion to their estimates

UCB: It’s not about optimism, it’s about realistic optimism

Thompson sampling: Balancing reward and risk

Summary