6 Improving agents' behaviors

In this chapter:

You learn about improving policies when learning from feedback that is simultaneously sequential and evaluative.
You develop algorithms for finding optimal policies in reinforcement learning environments when the transition and reward functions are unknown.
You write code of agents that can go from random to optimal behavior using only their experiences and decision-making, and apply them to a variety of environments.

When it is obvious that the goals cannot be reached, don't adjust the goals, adjust the action steps.

— Confucius

Chinese teacher, editor, politician, and philosopher, of the Spring and Autumn period of Chinese history

Up until this chapter, you have studied in isolation and interplay learning from two of the three different types of feedback a reinforcement learning agent must deal with: sequential, evaluative, and sampled. In chapter 2, you learned to represent sequential decision-making problems using a mathematical framework known as Markov Decision Processes. In chapter 3, you learned how to solve these problems with algorithms that extract policies from these MDPs. In chapter 4, you learned to solve simple control problems that are multi-option single-choice decision-making problems, called Multi-armed Bandits, when the MDP representation is not available to the agent. Finally, in chapter 5, we mixed these two types of control problems, that is, we dealt with control problems that are sequential and uncertain,

6.1 The anatomy of reinforcement learning agents

6.1.1 Most agents gather experience samples

6.1.2 Most agents estimate something

6 Improving agents' behaviors

In this chapter:

6.1 The anatomy of reinforcement learning agents

6.1.1 Most agents gather experience samples

6.1.2 Most agents estimate something

6.1.3 Most agents improve a policy

6.1.4 Generalized Policy Iteration

6.2 Learning to improve policies of behavior

6.2.1 Monte-Carlo Control: Improving policies after each episode

6.2.2 Sarsa: Improving policies after each step

6.3 Decoupling behavior from learning

6.3.1 Q-Learning: Learning to act optimally, even if we choose not to

6.3.2 Double Q-Learning: a max of estimates for an estimate of a max

6.4 Summary