chapter four

4 Balancing the gathering and utilization of information

 

This chapter covers:

  • You'll learn about the challenges of learning from evaluative feedback and how to properly balance the gathering and utilization of information.
  • You'll develop exploration strategies that accumulate low levels of regret in problems with unknown transition function and reward signals.
  • You'll code trial-and-error learning agents that learn to optimize their behavior through their own experiences in many-options one-choice environments known as multi-armed bandits.

Our ultimate objective is to make programs that learn from their experience as effectively as humans do.

— John McCarthy Founder of the field of Artificial Intelligence Inventor of the Lisp programming Language

No matter how small and unimportant a decision may seem, every decision you make is a tradeoff between information gathering and information exploitation. For example, when you go to your favorite restaurant, should you order your favorite dish, yet again, or should you request that dish you have been meaning to try? If a Silicon Valley startup offers you a job, should you make a career move, or should you stay put in your current role?

4.1   The challenge of interpreting evaluative feedback

4.1.1   Single state decision problem

4.1.2   Maximizing reward while minimizing regret

4.1.3   Approaches to solving MAB environments

4.1.4   Be greedy and always exploit

4.1.5   Learn forever and avoid the real world

4.1.6   Almost always pick the action with the highest value

4.1.7   First maximize exploration, then maximize exploitation

4.1.8   Start off believing it's a wonderful world

4.2   Strategic exploration

4.2.1   Select actions randomly in proportion to their estimates

4.2.2   It's not about just optimism; it's about realistic optimism

4.2.3   Balancing reward and risk

4.3   Summary