chapter eleven

11 Policy-gradient and actor-critic methods

In this chapter:

You learn about a family of deep reinforcement learning methods that can optimize their performance directly, without the need for value functions.
You learn how to use value function to make these algorithms even better.
You implement deep reinforcement learning algorithms that use multiple processes at once for very fast learning.

There is no better than adversity. Every defeat, every heartbreak, every loss, contains its own seed, its own lesson on how to improve your performance the next time.

— Malcolm X American Muslim minister and Human Rights activist.

So far, in this book, we have explored methods that can find optimal and near-optimal policies with the help of value functions. However, all of those algorithms learn value functions when what we need are policies.

In this chapter, we explore the other side of the spectrum and what is in the middle. We start exploring methods that optimize policies directly. These methods, referred to as policy- based or policy-gradient methods, parameterize a policy and adjust it to maximize expected returns.

11.1 REINFORCE: Outcome-based policy learning

11.1.1 Introduction to policy-gradient methods

11.1.2 Advantages of policy-gradient methods

11.1.3 Learning policies directly

11.1.4 Reducing the variance of the policy gradient

11.2 VPG: Learning a value function

11 Policy-gradient and actor-critic methods

In this chapter:

11.1 REINFORCE: Outcome-based policy learning

11.1.1 Introduction to policy-gradient methods

11.1.2 Advantages of policy-gradient methods

11.1.3 Learning policies directly

11.1.4 Reducing the variance of the policy gradient

11.2 VPG: Learning a value function

11.2.1 Further reducing the variance of the policy gradient

11.2.2 Learning a value function

11.2.3 Encouraging exploration

11.3 A3C: Parallel policy updates

11.3.1 Using actor-workers

11.3.2 Using n-step estimates

11.3.3 Non-blocking model updates

11.4 GAE: Robust advantage estimation

11.4.1 Generalized advantage estimation

11.5 A2C: Synchronous policy updates

11.5.1 Weight-sharing model

11.5.2 Restoring order in policy updates

11.6 Summary