10 Sample-efficient value-based methods

In this chapter

You will implement a deep neural network architecture that exploits some of the nuances that exist in value-based deep reinforcement learning methods.
You will create a replay buffer that prioritizes experiences by how surprising they are.
You will build an agent that trains to a near-optimal policy in fewer episodes than all the value-based deep reinforcement learning agents we’ve discussed.

Intelligence is based on how efficient a species became at doing the things they need to survive.

— Charles Darwin English naturalist, geologist, and biologist Best known for his contributions to the science of evolution

In the previous chapter, we improved on NFQ with the implementation of DQN and DDQN. In this chapter, we continue on this line of improvements to previous algorithms by presenting two additional techniques for improving value-based deep reinforcement learning methods. This time, though, the improvements aren’t so much about stability, although that could easily be a by-product. But more accurately, the techniques presented in this chapter improve the sample-efficiency of DQN and other value-based DRL methods.

Dueling DDQN: A reinforcement-learning-aware neural network architecture

Reinforcement learning isn’t a supervised learning problem

Nuances of value-based deep reinforcement learning methods

Advantage of using advantages

A reinforcement-learning-aware architecture

Building a dueling network

Reconstructing the action-value function

Continuously updating the target network

What does the dueling network bring to the table?

PER: Prioritizing the replay of meaningful experiences

A smarter way to replay experiences

Then, what’s a good measure of “important” experiences?

Greedy prioritization by TD error

Sampling prioritized experiences stochastically

Proportional prioritization

Rank-based prioritization

Prioritization bias

Summary