6 Improving agents’ behaviors

In this chapter

You will learn about improving policies when learning from feedback that is simultaneously sequential and evaluative.
You will develop algorithms for finding optimal policies in reinforcement learning environments when the transition and reward functions are unknown.
You will write code for agents that can go from random to optimal behavior using only their experiences and decision making, and train the agents in a variety of environments.

When it is obvious that the goals cannot be reached, don’t adjust the goals, adjust the action steps.

— Confucius
Chinese teacher, editor, politician, and philosopher of the Spring and Autumn period of Chinese history

The anatomy of reinforcement learning agents

Most agents gather experience samples

Most agents estimate something

Most agents improve a policy

Generalized policy iteration

Learning to improve policies of behavior

Monte Carlo control: Improving policies after each episode

SARSA: Improving policies after each step

Decoupling behavior from learning

Q-learning: Learning to act optimally, even if we choose not to

Double Q-learning: A max of estimates for an estimate of a max

Summary

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }