4 Balancing the gathering and utilization of information

In this chapter:

You learn about the challenges of learning from evaluative feedback and how to properly balance the gathering and utilization of information.
You develop exploration strategies that accumulate low levels of regret in problems with unknown transition function and reward signals.
You write code with trial-and-error learning agents that learn to optimize their behavior through their own experiences in many-options one-choice environments known as multi-armed bandits.

Our ultimate objective is to make programs that learn from their experience as effectively as humans do.

— John McCarthy

Founder of the field of Artificial Intelligence, Inventor of the Lisp programming Language

No matter how small and unimportant a decision may seem, every decision you make is a tradeoff between information gathering and information exploitation. For example, when you go to your favorite restaurant, should you order your favorite dish, yet again, or should you request that dish you have been meaning to try? If a Silicon Valley startup offers you a job, should you make a career move, or should you stay put in your current role?

4.1 The challenge of interpreting evaluative feedback

4.1.1 Bandits: Single state decision problems

4.1.2 Regret: The cost of exploration

4.1.3 Approaches to solving MAB environments

4 Balancing the gathering and utilization of information

In this chapter:

4.1 The challenge of interpreting evaluative feedback

4.1.1 Bandits: Single state decision problems

4.1.2 Regret: The cost of exploration

4.1.3 Approaches to solving MAB environments

4.1.4 Greedy: Always exploit

4.1.5 Random: Always explore

4.1.6 Epsilon-Greedy: Almost always greedy and sometimes random

4.1.7 Decaying Epsilon-Greedy: First maximize exploration, then exploitation

4.1.8 Optimistic Initialization: Start off believing it's a wonderful world

4.2 Strategic exploration

4.2.1 SoftMax: Select actions randomly in proportion to their estimates

4.2.2 UCB: It's not about just optimism; it's about realistic optimism

4.2.3 Thompson Sampling: Balancing reward and risk

4.3 Summary