chapter five

5 Tackling more complex problems with Actor-Critic methods

 

This chapter covers:

  • The limitations of the REINFORCE algorithm described in the previous chapter
  • Introducing a critic to improve sample efficiency and decrease variance
  • Using the advantage function to speed up convergence
  • Speeding up the model by parallelizing training

In the previous chapter we introduced a vanilla version of a policy gradient method called REINFORCE. The REINFORCE algorithm is generally implemented as an episodic algorithm, meaning that we only apply to update to our model parameters after the agent has completed an entire episode (and collecting rewards along the way). Recall the policy is a function π:S→P(a), that is, a function that takes a state and returns a probability distribution over actions. Then we sample from this distribution to get an action, such that the most probable action (the “best” action) is most likely to be sampled. At the end of the episode, we compute the return of the episode, which is essentially the sum of the discounted rewards in the episode. If action 1 was taken in state A and resulted in a return of +10, then the probability of action 1 given state A will be increased a little, whereas if action 2 was taken in state A and resulted in a return of -20 then the probability of action 2 given state A will decrease.  Essentially, we minimized this loss function:

loss= −log(P(a |  S)) R

5.1   Combining the value and policy function

5.2   Distributed Training

5.3   Advantage Actor-Critic

5.4   N-Step Actor-Critic

5.5   Summary and what’s next