6 Improving agents' behaviors
In this chapter:
- You learn about improving policies when learning from feedback that is simultaneously sequential and evaluative.
- You develop algorithms for finding optimal policies in reinforcement learning environments when the transition and reward functions are unknown.
- You write code of agents that can go from random to optimal behavior using only their experiences and decision-making, and apply them to a variety of environments.
When it is obvious that the goals cannot be reached, don't adjust the goals, adjust the action steps.
— Confucius
Chinese teacher, editor, politician, and philosopher, of the Spring and Autumn period of Chinese history
Up until this chapter, you have studied in isolation and interplay learning from two of the three different types of feedback a reinforcement learning agent must deal with: sequential, evaluative, and sampled. In chapter 2, you learned to represent sequential decision-making problems using a mathematical framework known as Markov Decision Processes. In chapter 3, you learned how to solve these problems with algorithms that extract policies from these MDPs. In chapter 4, you learned to solve simple control problems that are multi-option single-choice decision-making problems, called Multi-armed Bandits, when the MDP representation is not available to the agent. Finally, in chapter 5, we mixed these two types of control problems, that is, we dealt with control problems that are sequential and uncertain,