chapter five

5 Evaluating agents' behaviors

In this chapter:

You learn about estimating policies when learning from feedback that is simultaneously sequential and evaluative.
You develop algorithms for evaluating policies in reinforcement learning environments when the transition and reward functions are unknown.
You write code for estimating the value of policies in environments in which the full reinforcement learning problem is on display.

I conceive that the great part of the miseries of mankind are brought upon them by false estimates they have made of the value of things.

— Benjamin Franklin Founding Father of the United States an author, politician, inventor, and a civic activist.

You know how challenging it is to balance immediate and long-term goals. You probably experience this multiple times a day: should you watch movies tonight, or keep reading this book? One has an immediate satisfaction to it; you watch the movie, and you go from poverty to riches, from loneliness to love, from overweight to fit, etc., in about two hours and while eating popcorn. Reading this book, on the other hand, won’t really give you much tonight, but maybe, and only maybe, much higher satisfaction in the long term.

5.1 Learning to estimate the value of policies

5 Evaluating agents' behaviors

In this chapter:

5.1 Learning to estimate the value of policies

5.1.1 First-visit Monte-Carlo: Improving estimates after each episode

5.2 Every-visit Monte-Carlo: A different way of handling state visits

5.2.1 Temporal-Difference Learning: Improving estimates after each step

5.3 Learning to estimate from multiple steps

5.3.1 N-step TD Learning: Improving estimates after a couple of steps

5.3.2 Forward-view TD(λ): Improving estimates of all visited states

5.3.3 TD(λ): Improving estimates of all visited states after each step

5.4 Summary