11 Policy-gradient and actor-critic methods

 

In this chapter

  • You will learn about a family of deep reinforcement learning methods that can optimize their performance directly, without the need for value functions.
  • You will learn how to use value function to make these algorithms even better.
  • You will implement deep reinforcement learning algorithms that use multiple processes at once for very fast learning.

There is no better than adversity. Every defeat, every heartbreak, every loss, contains its own seed, its own lesson on how to improve your performance the next time.

— Malcolm X American Muslim minister and human rights activist

In this book, we’ve explored methods that can find optimal and near-optimal policies with the help of value functions. However, all of those algorithms learn value functions when what we need are policies.

In this chapter, we explore the other side of the spectrum and what’s in the middle. We start exploring methods that optimize policies directly. These methods, referred to as policy-based or policy-gradient methods, parameterize a policy and adjust it to maximize expected returns.

REINFORCE: Outcome-based policy learning

 
 
 

Introduction to policy-gradient methods

 
 
 

Advantages of policy-gradient methods

 
 
 

Learning policies directly

 
 

Reducing the variance of the policy gradient

 

VPG: Learning a value function

 
 

Further reducing the variance of the policy gradient

 
 

Learning a value function

 
 
 

Encouraging exploration

 
 

A3C: Parallel policy updates

 
 

Using actor-workers

 
 
 

Using n-step estimates

 
 
 

Non-blocking model updates

 
 

GAE: Robust advantage estimation

 
 
 

Generalized advantage estimation

 
 

A2C: Synchronous policy updates

 
 
 
 

Weight-sharing model

 
 

Restoring order in policy updates

 
 
 

Summary

 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage