5 Evaluating agents’ behaviors

In this chapter

You will learn about estimating policies when learning from feedback that is simultaneously sequential and evaluative.
You will develop algorithms for evaluating policies in reinforcement learning environments when the transition and reward functions are unknown.
You will write code for estimating the value of policies in environments in which the full reinforcement learning problem is on display.

I conceive that the great part of the miseries of mankind are brought upon them by false estimates they have made of the value of things.

— Benjamin Franklin Founding Father of the United States an author, politician, inventor, and a civic activist

You know how challenging it is to balance immediate and long-term goals. You probably experience this multiple times a day: should you watch movies tonight or keep reading this book? One has an immediate satisfaction to it; you watch the movie, and you go from poverty to riches, from loneliness to love, from overweight to fit, and so on, in about two hours and while eating popcorn. Reading this book, on the other hand, won’t really give you much tonight, but maybe, and only maybe, will provide much higher satisfaction in the long term.

Learning to estimate the value of policies

First-visit Monte Carlo: Improving estimates after each episode

5 Evaluating agents’ behaviors

In this chapter

Learning to estimate the value of policies

First-visit Monte Carlo: Improving estimates after each episode

Every-visit Monte Carlo: A different way of handling state visits

Temporal-difference learning: Improving estimates after each step

Learning to estimate from multiple steps

N-step TD learning: Improving estimates after a couple of steps

Forward-view TD(λ): Improving estimates of all visited states

TD(λ): Improving estimates of all visited states after each step

Summary

5 Evaluating agents’ behaviors

In this chapter

Learning to estimate the value of policies

First-visit Monte Carlo: Improving estimates after each episode

Every-visit Monte Carlo: A different way of handling state visits

Temporal-difference learning: Improving estimates after each step

Learning to estimate from multiple steps

N-step TD learning: Improving estimates after a couple of steps

Forward-view TD(λ): Improving estimates of all visited states

TD(λ): Improving estimates of all visited states after each step

Summary

Unable to load book!