chapter ten

10 Sample-efficient value-based methods

In this chapter:

You implement a deep neural network architecture that exploits some of the nuances that exist in value-based deep reinforcement learning methods.
You create a replay buffer that prioritizes experiences by how surprising they are.
You build an agent that trains to a near-optimal policy in fewer number of episodes than all previous value-based deep reinforcement learning agents.

Intelligence is based on how efficient a species became at doing the things they need to survive.

— Charles Darwin, English naturalist, geologist, and biologist Best known for his contributions to the science of evolution.

In the previous chapter, we improved on NFQ with the implementation of DQN and DDQN. In this chapter, we continue on this line of improvements to previous algorithms by presenting two additional techniques for improving value-based deep reinforcement learning methods. This time, though, the improvements are not so much about stability, although that could easily be a by-product. But more accurately, the techniques presented in this chapter improve the sample-efficiency of DQN, and other value-based DRL methods.

First, we introduce a functional neural network architecture that splits the Q-function representation into two streams. One stream approximates the V-function, and the other stream approximates the A-function. V-functions are per-state values, while A-functions express the distance of each action from their V-functions.

10.1 Dueling DDQN: A reinforcement-learning-aware neural network architecture

10.1.1 Reinforcement learning is not a supervised learning problem

10.1.2 Nuances of value-based deep reinforcement learning methods

10.1.3 Advantage of using advantages

10.1.4 A reinforcement-learning-aware architecture

10 Sample-efficient value-based methods

In this chapter:

10.1 Dueling DDQN: A reinforcement-learning-aware neural network architecture

10.1.1 Reinforcement learning is not a supervised learning problem

10.1.2 Nuances of value-based deep reinforcement learning methods

10.1.3 Advantage of using advantages

10.1.4 A reinforcement-learning-aware architecture

10.1.5 Building a dueling network

10.1.6 Reconstructing the action-value function

10.1.7 Continuously updating the target network

10.1.8 What does the dueling network bring to the table?

10.2 PER: Prioritizing the replay of meaningful experiences

10.2.1 A smarter way to replay experiences

10.2.2 Then, what is a good measure of “important” experiences?

10.2.3 Greedy prioritization by TD error

10.2.4 Sampling prioritized experiences stochastically

10.2.5 Proportional prioritization

10.2.6 Rank-based prioritization

10.2.7 Prioritization bias

10.3 Summary