12 Advanced actor-critic methods

In this chapter

You will learn about more advanced deep reinforcement learning methods, which are, to this day, the state-of-the-art algorithmic advancements in deep reinforcement learning.
You will learn about solving a variety of deep reinforcement learning problems, from problems with continuous action spaces, to problem with high-dimensional action spaces.
You will build state-of-the-art actor-critic methods from scratch and open the door to understanding more advanced concepts related to artificial general intelligence.

Criticism may not be agreeable, but it is necessary. It fulfills the same function as pain in the human body. It calls attention to an unhealthy state of things.

— Winston Churchill British politician, army officer, writer, and Prime Minister of the United Kingdom

In the last chapter, you learned about a different, more direct, technique for solving deep reinforcement learning problems. You first were introduced to policy-gradient methods in which agents learn policies by approximating them directly. In pure policy-gradient methods, we don’t use value functions as a proxy for finding policies, and in fact, we don’t use value functions at all. We instead learn stochastic policies directly.

DDPG: Approximating a deterministic policy

DDPG uses many tricks from DQN

Learning a deterministic policy

Exploration with deterministic policies

TD3: State-of-the-art improvements over DDPG

Double learning in DDPG

Smoothing the targets used for policy updates

Delaying updates

SAC: Maximizing the expected return and entropy

Adding the entropy to the Bellman equations

Learning the action-value function

Learning the policy

Automatically tuning the entropy coefficient

PPO: Restricting optimization steps

Using the same actor-critic architecture as A2C

Batching experiences

Clipping the policy updates

Clipping the value function updates

Summary