chapter eight

8 Introduction to value-based deep reinforcement learning

 

This chapter covers

  • You’ll understand the inherent challenges of training reinforcement learning agents with non-linear function approximators.
  • You’ll create a deep reinforcement learning agent that when trained from scratch with minimal adjustments to hyper-parameters can solve different kinds of problems.
  • You’ll identify the advantages and disadvantages of using value-based methods when solving reinforcement learning problems.

Human behavior flows from three main sources: desire, emotion, and knowledge.

— PlatoA philosopher in Classical Greeceand Founder of the Academy in Athens

8.1   The kind of feedback a deep reinforcement learning agent deals with

8.1.1   Deep reinforcement learning deals with sequential feedback

Deep reinforcement learning agents deal with sequential, evaluative and sampled feedback. Up until now, you studied two of the three properties (sequential and evaluative) both in isolation (MDPs is sequential and Bandits is evaluative) and then in interplay ('tabular' reinforcement learning is both sequential and evaluative).

Initially, we examined the issues with sequential feedback in which actions have not only immediate but also long-term consequences. Remember MDPs? Value Iteration? Policy Iteration?

Figure 8.1  Sequential feedback

8.1.2   But, if it is not sequential, what is it?

8.2   Deep reinforcement learning deals with evaluative feedback

8.2.1   But, if it is not evaluative, what is it?

8.2.2   Deep reinforcement learning deals with sampled feedback

8.2.3   But, if it is not sampled, what is it?

8.2.4   Deep reinforcement learning deals with the most challenging sides of all dimensions

8.3   Introduction to value-function approximation

8.3.1   What's a high-dimensional state space?

8.3.2   How about continuous state space?

8.3.3   But, why to use a function approximator?

8.4   NFQ: A first attempt to value-based deep reinforcement learning

8.4.1   First decision point: Selecting a value function to approximate

8.4.2   Second decision point: Selecting a neural network architecture

8.4.3   Third decision point: Selecting what to optimize

8.4.4   Fourth decision point: Targets for policy evaluation

8.4.5   Fifth decision point: Balancing exploration and exploitation

8.4.6   Sixth decision point: Selecting a loss function

8.4.7   Seventh decision point: Selecting an optimization method to minimize the loss function

8.4.8   Regrets: Things that could (and do) go wrong

8.5   Summary