6 Improving agents' behaviors

In this chapter:

You learn about improving policies when learning from feedback that is simultaneously sequential and evaluative.
You develop algorithms for finding optimal policies in reinforcement learning environments when the transition and reward functions are unknown.
You write code of agents that can go from random to optimal behavior using only their experiences and decision-making, and apply them to a variety of environments.

When it is obvious that the goals cannot be reached, don’t adjust the goals, adjust the action steps.

— Confucius Chinese teacher, editor, politician, and philosopher of the Spring and Autumn period of Chinese history

6.1 The anatomy of reinforcement learning agents

6.1.1 Most agents gather experience samples

6.1.2 Most agents estimate something

6.1.3 Most agents improve a policy

6.1.4 Generalized Policy Iteration

6.2 Learning to improve policies of behavior

6.2.1 Monte-Carlo Control: Improving policies after each episode

6.2.2 Sarsa: Improving policies after each step

6.3 Decoupling behavior from learning

6.3.1 Q-Learning: Learning to act optimally, even if we choose not to

6.3.2 Double Q-Learning: a max of estimates for an estimate of a max

6.4 Summary

sitemap

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }