5 Evaluating agents’ behaviors

 

In this chapter

  • You will learn about estimating policies when learning from feedback that is simultaneously sequential and evaluative.
  • You will develop algorithms for evaluating policies in reinforcement learning environments when the transition and reward functions are unknown.
  • You will write code for estimating the value of policies in environments in which the full reinforcement learning problem is on display.

I conceive that the great part of the miseries of mankind are brought upon them by false estimates they have made of the value of things.

— Benjamin Franklin Founding Father of the United States an author, politician, inventor, and a civic activist

You know how challenging it is to balance immediate and long-term goals. You probably experience this multiple times a day: should you watch movies tonight or keep reading this book? One has an immediate satisfaction to it; you watch the movie, and you go from poverty to riches, from loneliness to love, from overweight to fit, and so on, in about two hours and while eating popcorn. Reading this book, on the other hand, won’t really give you much tonight, but maybe, and only maybe, will provide much higher satisfaction in the long term.

Learning to estimate the value of policies

 
 
 
 

First-visit Monte Carlo: Improving estimates after each episode

 
 
 

Every-visit Monte Carlo: A different way of handling state visits

 
 

Temporal-difference learning: Improving estimates after each step

 
 

Learning to estimate from multiple steps

 
 

N-step TD learning: Improving estimates after a couple of steps

 

Forward-view TD(λ): Improving estimates of all visited states

 
 
 
 

TD(λ): Improving estimates of all visited states after each step

 
 

Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage