11 Policy-gradient and actor-critic methods

In this chapter

You will learn about a family of deep reinforcement learning methods that can optimize their performance directly, without the need for value functions.
You will learn how to use value function to make these algorithms even better.
You will implement deep reinforcement learning algorithms that use multiple processes at once for very fast learning.

There is no better than adversity. Every defeat, every heartbreak, every loss, contains its own seed, its own lesson on how to improve your performance the next time.

— Malcolm X
American Muslim minister and human rights activist

In this book, we’ve explored methods that can find optimal and near-optimal policies with the help of value functions. However, all of those algorithms learn value functions when what we need are policies.

In this chapter, we explore the other side of the spectrum and what’s in the middle. We start exploring methods that optimize policies directly. These methods, referred to as policy-based or policy-gradient methods, parameterize a policy and adjust it to maximize expected returns.

REINFORCE: Outcome-based policy learning

Introduction to policy-gradient methods

Advantages of policy-gradient methods

Learning policies directly

Reducing the variance of the policy gradient

VPG: Learning a value function

Further reducing the variance of the policy gradient

Learning a value function

Encouraging exploration

A3C: Parallel policy updates

Using actor-workers

Using n-step estimates

Non-blocking model updates

GAE: Robust advantage estimation

Generalized advantage estimation

A2C: Synchronous policy updates

Weight-sharing model

Restoring order in policy updates

Summary