12 Advanced actor-critic methods

In this chapter:

You learn about more advanced deep reinforcement learning methods, which are, to this day, the state-of-the-art algorithmic advancements in deep reinforcement learning.
You learn about solving a variety of deep reinforcement learning problems, from problems with continuous-action spaces, to problem with high-dimensional action spaces.
You build state-of-the-art actor-critic methods from scratch and open the door to understanding more advanced concepts related to artificial general intelligence.

Criticism may not be agreeable, but it is necessary. It fulfills the same function as pain in the human body. It calls attention to an unhealthy state of things.

— Winston Churchill, British politician, army officer, writer, and Prime Minister of the United Kingdom

In the last chapter, you learned about a different, more direct technique for solving deep reinforcement learning problems. You first were introduced to policy-gradient methods in which agents learn policies by approximating them directly. In pure policy-gradient methods, we do not use value functions as a proxy for finding policies, and in fact, we do not use value functions at all. We instead learn stochastic policies directly.

12.1 DDPG: Approximating a deterministic policy

12.1.1 DDPG uses lots of tricks from DQN

12.1.2 Learning a deterministic policy

12.1.3 Exploration with deterministic policies

12.2 TD3: State-of-the-art improvements over DDPG

12.2.1 Double learning in DDPG

12.2.2 Smoothing the targets used for policy updates

12.2.3 Delaying updates

12.3 SAC: Maximizing the expected return and entropy

12.3.1 Adding the entropy to the Bellman equations

12.3.2 Learning the action-value function

12.3.3 Learning the policy

12.3.4 Automatically tuning the entropy coefficient

12.4 PPO: Restricting optimization steps

12.4.1 Using the same actor-critic architecture as A2C

12.4.2 Batching experiences

12.4.3 Clipping the policy updates

12.4.4 Clipping the value function updates

12.5 Summary