This chapter covers
- Why a full probability distribution is better than a single number
- Extending ordinary deep Q-networks to output full probability distributions over Q values
- Implementing a distributional variant of DQN to play Atari Freeway
- Understanding the ordinary Bellman equation and its distributional variant
- Prioritizing experience replay to improve training speed
We introduced Q-learning in chapter 3 as a way to determine the value of taking each possible action in a given state; the values were called action values or Q values. This allowed us to apply a policy to these action values and to choose actions associated with the highest action values. In this chapter we will extend Q-learning to not just determine a point estimate for the action values, but an entire distribution of action values for each action; this is called distributional Q-learning. Distributional Q-learning has been shown to result in dramatically better performance on standard benchmarks, and it also allows for more nuanced decision-making, as you will see. Distributional Q-learning algorithms, combined with some other techniques covered in this book, are currently considered a state-of-the-art advance in reinforcement learning.