6 Stabilizing value-based deep reinforcement learning method

published book

In this chapter:

  • You'll improve on the methods you learned in the previous chapter by making them more stable and therefore less prone to divergence.
  • You'll explore advanced value-based deep reinforcement learning methods, and the many components that make value-based methods better.
  • You'll implement more complex exploration strategies and flexible loss functions with function approximation.
  • You'll solve the cart-pole environment in a fewer number of samples, and with more reliable and consistent results.

"Ykxut zvt eitms J sm yphpa. Abotx cxt esimt J mz asu. Yrg J sawlya trd vr ertsaaep noiomte kmlt ruv nxvp rv arhce ktl hmegsiotn onresrgt, drpeee. Xgn nvrp vn tmreat xrb moionte, J nss cerah vlt s abltityis cprr seplh om csocmhlipa wrbs cj orq fcxd."

Xtkg Vmauaoll , Y mferro Bamrnice flalotbo gtnrso tsyaef, Smaano setnced

join today to enjoy all our content. all the time.
 

6.1   DQN: Making reinforcement learning more like supervised learning

Common problems in value-based deep reinforcement learning

It's important we are clear and understand the two most common problems that consistenly show up in value-based deep reinforcement learning.

Refresh My Memory

Non-stationarity of targets

 

The first problem is the non-stationarity of the target values. These are the targets we use to train our network with, but these targets are calculated using the network itself. For instance, for NFQ we used the off-policy TD target: (r + gamma*max_a'Q(s',a'; w)), and as you can see this target value is calculated using Q, which is the function we are trying to estimate.

In supervised learning, the targets are the labels on your dataset and are fixed throughout training. In reinforcement learning, and the extreme case of fully online learning, these targets would move freely with every training step of the network. At every update, we improve the value function and therefore change the shape of possibly the entire function. That means the target values change as well. Which means, our estimates are invalid with every update, since they have already changed. In NFQ, we lessen this problem by "fitting to" progressively more accurate values. We do this by recalculating the targets and updating the network several times before going out to collect more samples. Also, because we do this with a large batch of samples, the updates to the neural network will be compose of many points along the function and therefore make more stable changes. Still, though, we certainly can do better, and we will in this chapter.

Refresh My Memory

Data is not Independent and Identically distributed (I.I.D.)

 

The second problem is the non-compliance with the IID assumption of the data. Optimization methods have been developed with the assumption that samples in the data set we train with are independent and identically distributed.

We know, however, our samples are not independent, but instead, they come from a sequence, a time series, a trajectory. The sample at time step t+1, is dependent on sample at time step t. Samples are correlated and we can't prevent that from happening, it is a natural consequence of online learning.

But samples are also not identically distributed as they depend on the policy that generates the actions. We know the policy is changing through time, and for us that's a good thing. We want policies to improve. But that also means the distribution of samples (state-action pairs visited) will change as we keep improving.

But, not all is lost. In DQN, we try to solve the question: How do we make reinforcement learning look more like supervised learning? Are there any tweaks we can implement to make targets more stationary and the data more IID??

Using a target network

A very straightforward way to make target values more stationary is to have a separate network which we can fix for multiple steps and use it to calculate more stable target values. This network is called target network, as it is used to calculate targets.
DQN with target networks
Divergence in DQN without Target Networks
By having a target network and fixing our target values, we mitigate the " chasing your own tail" issue by artificially creating multiple small supervised learning problems: Our targets are now fixed for as many steps as we fix our target network. This improves our chances of convergence, not to the optimal values because such things don't exist with non-linear function approximation, but convergence in general. Also, and more importantly, it substantially reduces the chances of divergence. It is important to note that in practice we don't really have two "networks", but instead, two instances of the neural network weights. We use the same model architecture and just frequently set the variable holding the target network weights to the online network weights (the network we optimize every step). "Frequently" here means something different depending on the problem, unfortunatelly. It is common to freeze these target network weights for 10 to 10,000 time steps at a time, again depending on the problem (that's time steps, not episodes. Be careful there!). If you are using a convolutional neural network, such as in ATARI games, then 10,000 time steps is the norm. But this is too much for simpler problems such as the cart-pole environment in which 10-20 steps is more appropriate.

Show Me The Math

Target Network

 

By using target networks we prevent the training process from spiraling around because we are fixing the targets for multiple timesteps, thus allowing the online network weights to catch up to the targets before a new target is set. Though, unfortunately, we also slow down learning because you are no longer training on up-to-date values; the frozen weights can be lagging for up-to 10,000 timesteps at any given time.

ISpeak Python

Use of the target and online networks in DQN

Use larger networks

Another way you can lessen this non-stationarity issue, to some degree, is to use larger networks. Larger networks reduce the aliasing of states; the bigger the network, the less the aliasing. The less the aliasing, the less apparent correlation between consecutive samples. And all of this can make target values and current estimates more independent of each other. By "aliasing" here I refer to the fact that two states can look like the same (or very similar) state to the neural network, but still possibly require different actions. State aliasing happens when networks lack representational power. After all, neural networks are trying to find similarities in order to generalize. But, too small of a network and the "generalization" can go wrong. One of the motivations for using a target network is that they allow you to more easily differentiate between correlated states. Using a larger network helps your network "see" these differences, too. That being said, a larger networks will take longer to train. It will need not only more data (interaction time), but also more compute (processing time). Simply using a target network is a more effective approach to mitigating the non-stationary problem. Still, it is good for you to know how these two properties of your agent (the size of your networks, and the use of target networks, along with the update frequency), will interact in interesting ways, and affect performance in similar ways.

Boil It Down

Ways to mitigate the fact that targets in reinforcement learning are non-stationary

 

Allow me to restate that in order to mitigate the non-stationarity issue we can (1) create a target network that provides us with temporarily stationary target values, and (2) create large enough networks so that they can "see" the small differences between similar states (like those temporally correlated).

Now, target networks work and work well. Have been proven to work multiple times. "Larger networks" is more of a hand-wavy solution than something that's been proven to work. Though, feel free to experiment with the Notebooks provided. You'll find it very easy to change values and test hypotheses.

Experience Replay

In our NFQ agent, we hold a batch of 1,024 samples and we use it multiple times (40 in our specific implementation) alternating between calculating targets and optimizing the network. Now, these 1,024 samples are temporally correlated since they are collected in the same trajectory (episode), or at least many of them will be correlated, but not all, since the maximum number of steps in a cart-pole episode is 500. One way to improve on this is to use what is called experience replay. This strategy consists of a data structure, often referred to as a replay buffer or a replay memory, that holds experience samples over multiple steps (much more than 1,024 samples), and allows for the sampling of past experiences, commonly uniformly at random.

0001

A Bit Of History

Introduction of experience replay

 

Experience replay was introduced by Long-Ji Lin on a paper titled "Self-Improving Reactive Agents Based On Reinforcement Learning, Planning and Teaching", believe it or not, published in 1992! Yep, that's when neural networks were referred to as "connectionism"... Yikes!

After getting his Ph.D. from CMU, Dr. Lin has moved through several technical roles in many different companies. Currently, he's the Chief Scientist at Signifyd, leading a team that works on a system to predict and prevent online fraud.

There are multiple benefits to using experience replay. By sampling at random we increase the probability that our updates to the neural network will have less variance. When we use the batch in NFQ, most of the samples in that batch were correlated and similar. Updating with similar samples concentrates the updates we make to our neural network to a limited area of our function, and it potentially over-emphasizes the magnitude of the updates. If we sample uniformly at random from a very large buffer, on the other hand, chances are our updates to the network will be better distributed all across, and therefore more representative of the true function. Using a replay buffer also gives the impression our data is IID, so optimization methods will be better behaved. Samples will seem independent and identically distributed because we will be sampling from multiple trajectories and policies at once. By storing experiences and later sampling them uniformly, we make the data entering the optimization method look independent and identically distributed. In practice, the replay buffer needs to have a very large capacity, from 10,000 to 1,000,000 experiences depending on the problem for this to work well. Once you hit the maximum size, you evict the oldest experience before inserting the new one.
DQN with Replay Buffer

Boil It Down

Experience replay makes the data look more IID

 

The best solution to our second problem (data is not IID) is called experience replay.

The technique is very simple and it’s been around for decades: As your agent collects experiences tuples et=(st,at,rt+1,st+1) online, we insert them into a data structure, commonly referred to as the replay buffer D, such that D={e1, e2 , ... , eM}. M, the size of the replay buffer, is a value often between 10,000 to 1,000,000, again depending on the problem.

We can then train the agent on mini-batches sampled from the buffer, usually sampled uniformly at random, meaning each sample with equal probability of being selected, but possibly with some other distribution (but beware it is not that straighforward, we'll discuss details about other ways of sampling experiences in the next chapter).

 

Show Me The Math

Replay Buffer

 

Unfortunately, the implementation becomes a little bit of a challenge when working with high-dimensional observations and a buffer of large capacity. In problems such as ATARI games, for instance, if you are learning from raw images, and each state representation is a stack of the 4 latest images, as it is common for ATARI games, you probably don’t have enough memory on your personal computer to naively store 1,000,000 experience samples. For the cart-pole environment, this is not much of a problem. First, we don't need 1,000,000 samples, and we will use a buffer of size 50,000 instead. But also, each state is represented by a vector of only 4 elements. Nevertheless, by using a replay buffer your data looks more IID than it really is, and by training from uniformly sampled mini batches, you make the samples look more like a traditional supervised learning dataset. Sure, data is still changing as you add new and discard old samples, but these changes are happening slowly and so they go somewhat unnoticed by the neural network and optimizer.

ISpeak Python

A simple replay buffer

Using other exploration strategies

Exploration is a vital component of reinforcement learning. In the NFQ algorithm, we explored using a strategy known as "epsilon greedy": You select the "greedy action" (action with the current highest estimated value) every time, unless a number drawn from a uniform distribution [0, 1) is less than a hyper-parameter constant, called epsilon, in which case you select an action uniformly (including the greedy action). There are additional strategies that were introduced in chapter 3 and chapter 4 and I have adapted them to use with neural networks. They are all included in the Notebooks and ready to be tested. Feel free to play around, and have fun.

ISpeak Python

Linearly decaying epsilon-greedy exploration strategy

 

ISpeak Python

Exponentially decaying epsilon-greedy exploration strategy

 

ISpeak Python

SoftMax exploration strategy

In the plots in the right, you can see how the value of epsilon (or temperature) changes as time steps go by.
In NFQ, we used epsilon greedy with a constant value of 0.5. Yes! That is 50% of the time we acted greedily and 50% of the time we chose uniformly at random. Given that there are only two actions in this environment, the effective probability of choosing the greedy action is 75% and the chance of selecting the non-greedy action is 25%. Notice how, if the action space is larger, the probability of selecting the greedy action would be smaller. In the Notebooks, I output this effective probability value under `ex 100`. That means "ratio of exploratory action over the last 100 steps".
In DQN and all remaining value-based algorithms in this and the following chapter, I will use the exponentially decaying epsilon-greedy strategy. I just happen to prefer this one, but other strategies may be worth exploring. I noticed even a small difference in hyper-parameters makes a very big difference in performance. Wanna try?
I highly encourage you to go through the Notebooks and play with the many different hyper-parameters, exploration strategies and so on. There is a lot more to deep reinforcement learning than just the algorithms.
Once you play around with the hyper-parameters, you'll understand why there are complaints on deep reinforcement learning papers reproducibility. RL is hard.

It's In The Details

The full Deep Q-Network (DQN) algorithm

 

DQN has very similar components and settings to NFQ, we:

·   Approximate the action-value function Q(s,a; w).

·   Use a state-in-values-out architecture (nodes: 4, 512,128, 2).

·   Optimize the action-value function to approximate the optimal action-value function q*(s,a).

·   Use off-policy TD targets (r + gamma*max_a'Q(s',a')) to evaluate policies.

·   Use mean squared error (MSE) for our loss function.

·   Use RMSprop as our optimizer with a learning rate of 0.0005.

In the DQN implementation we now:

·   Use an exponentially decaying epsilon-greedy strategy (from 1.0 to 0.3 in roughly 20,000 steps) to improve policies.

·   Use a replay buffer with 320 samples min, 50,000 max, and a batch of 64.

·   Use a target network that freezes for 15 steps and then updates fully.

DQN has 3 main steps:

 

  1)  Collect experience: (st, at, rt+1, st+1, dt+1), and insert into the replay buffer.

  2)  Pull a batch out of the buffer and calculate the off-policy TD targets: r + gamma*max_a'Q(s',a').

  3)  Fit the action-value function Q(s,a; w): Using MSE and RMSprop.

 

 

0001

A Bit Of History

Introduction of the DQN Algorithm

 

DQN was introduced in 2013 by Volodymyr "Vlad" Mnih in a paper called "Playing Atari with Deep Reinforcement Learning". This paper introduced DQN with experience replay. In 2015, another paper came out: "Human-level control through deep reinforcement learning". This second paper introduced DQN with the addition of target networks; the full DQN version you just learned about.

Vlad got his Ph.D. under Geoffrey Hinton (one of the fathers of deep learning), and works as a Research Scientist at Google DeepMind. He's been recognized for his DQN contributions, and has been included in the 2017 MIT Technology Review 35 Innovators under 35 list.

 

Tally

DQN passes the cart-pole environment

Get Grokking Deep Reinforcement Learning epub
add to cart

6.2   Double DQN: Mitigating the overestimation of approximate action-value functions

The problem of overestimation

As you can probably remember from chapter 4, Q-learning has a tendency to overestimate action-value functions. Our DQN agent is no different, we are using the same off-policy TD target after all with a max operator. The crux of the problem is very simple: We are taking the max of estimated values. Estimated values will be off, some higher than the true values, some lower, but the bottom line is they will be off. Now, the problem is we are always taking the max of these values. So, we have a preference for higher values, even if they are not correct, and as such, our algorithms will show a positive bias and performance will suffer.

Miguel's Analogy

The issue with over-optimistic agents, and people

 

No, seriously, just imagine you meet a very optimistic person, let's call her DQN. DQN is very optimistic. She's experienced many things in life, from the toughest defeat, to the highest success. The problem with DQN, though, is she expects the sweetest possible outcome from every single thing she does. Is that a problem?

One day, DQN went to a local casino. It was a first time, but lucky DQN got the Jackpot at the slot machines. Optimistic as she is, DQN immediately adjusted her value function: "Going to the casino is very rewarding [value of Q(s,a) should be very high] because at the casino you can go to the slot machines [next state s'] and playing the slot machines gives you the jackpot [maxaQ(s')]".

But, there are multiple issues with this thinking. To begin with, not everytime that DQN goes to the casino, she plays the slot machines. She likes to try new things too [explore], and sometimes she tries the roulette, poker or blackjack [tries a different action]. Sometimes the slot machines area is under maintenance, and not accessible [environment transitions her somewhere else]. Also, most of the times DQN plays the slot machines, she doesn't get the jackpot [environment stochasticity]. After all, slot machines are called bandits for a reason.

Separating action selection and action evaluation

One way to better understand the positive bias and how we can address it is by unwrapping the max operator in the target calculations. The max of an action-value function is the same as the action-value function of the argmax action.

Refresh My Memory

What's an argmax, again?

 

The argmax function is defined as the arguments of the maxima. The argmax action-value function "argmaxaQ(s,a)" is just the index of the action with the maximum value at the given state s.

So, for example, if you have a Q(s) with values [-1, 0 , -4, -9] for actions 0-3, the maxaQ(s, a) is 0, which is the maximum value, and the argmaxaQ(s, a) is 1 which is the index of the maximum value.

So, let's unpack the previous sentence with the max and argmax. At first, this unwrapping might seem like a silly step, but it actually helps understand how to mitigate this problem.

Show Me The Math

Unwrapping the argmax

 

ISpeak Python

Unwrapping the max in DQN

 

Rff vw tsv yngisa ptxo cj srrp itagnk gkr mec jz jxfx lj wv wtov giaksn:

"What’s the value of the highest-valued action?"

Crh, ow tks erylla ngksia kwr einsqsout rjpw z esilng niseqout. Patrj, xw pv cn xrmgaa, hcwih aj ileauentqv rv gasnki:

"Which action is the highest-valued action?"

Bpn nrop kw opz rrbs otcani vr qrv jar auelv. Ftenqliuav rk iasngk:

"Mzgr’z rxd vlaue lk jrua acoitn (iwhhc speahnp er hk kyr sthehig-udveal caoint)?"

Dxn vl rdv mbseropl aj crrq vw tvs sagkni hseet rkw niqoesstu rv brk cmkz nioact-auevl iconufnt, hchiw fwfj wceq rqv azmk zcqj jn dkr sneswra kr kuqr niqssutoe.

In other words, the function approximator will answer:

"J ntihk rcju nxx jz ruo gehihst-luaedv icanto, gzn jqrc aj jzr ualve."

A solution

A way to reduce the chance of a positive bias is to have two instances of the action-value function. If you had another source of the estimates, you can then ask one of the questions to one source and the other question to the other. It’s somewhat like taking votes, or like an "I cut, you choose first" procedure, or just like getting a second doctor's opinion on your health. In double learning, one estimator will select the index of what it believes to be the highest-valued action, and the other estimator will give you the value this second estimator has for the action index selected by the first estimator.

Refresh My Memory

Double learning procedure

 

We did this procedure with tabular reinforcement learning in Chapter 4 under the Double Q-learning algorithm. It goes like this:

·   You create two action-value functions, QA and QB.

·   You flip a coin to decide which action-value function to update. E.g.: QA on heads, QB on tails.

·   If you got a heads and thus get to update QA: You select the action index to evaluate from QB, and evaluate it using the estimate QA predicts. Then, you proceed to update QA as usual, and leave QB alone.

·   If you got a tails and thus get to update QB, you do it the other way around: Get the index from QA,

Implementing this double learning procedure exactly as described but when using function approximation (for DQN) creates a lot of overhead. We would need four networks: The two networks for training ( Q A, Q B) and two target networks (one for each online network). But also, we would be training only one of these networks at a time, and so only one network would improve per timestep. This is certainly a waste. Although, doing this double learning procedure with function approximators may still be better than not doing it at all, despite the extra overhead, fortunately for us, there is a simple modification to the original double learning procedure that adapts it to DQN and give us similar improvements without the extra overhead.

A more practical solution

Instead of adding this extra overhead that is a detriment to training speed, we can do double learning with the "extra" network we already have, that is the target network. However, instead of training both the online and target networks, we continue training only the online network, but use the target network to help us create more stable estimates. We want to be cautious as to which network to use for action selection and which network to use for action evaluation. Originally, we added the target network to stabilize training by preventing chasing a moving target. To continue on this path, we want to make sure we use the network we are training, the online network, to answer the first question, or in other words, to find the index of the best action. And then, use the target network to ask the second question, that is, to evaluate the previously selected action index. This is the ordering that works best in practice and it makes sense why it works. By using the target network for value estimates, we make sure the target values are frozen as needed for stability. If we were to implement it the other way around, the values would come from the online network, which is getting updated at every time step, and therefore changing continuously.
Selecting action, evaluating action

0001

A Bit Of History

Introduction of the Double DQN Algorithm

 

Double DQN was introduced in 2015 by Hado van Hasselt, shortly after the release of the 2015 version of DQN (The 2015 version of DQN is sometimes referred to as 'Nature' DQN — because it was published in the Nature scientific journal, and sometimes as 'Vanilla' DQN — because it is the first of many).

In 2010, Hado also authored the Double Q-learning algorithm (double learning for the tabular case), as an improvement to the Q-learning algorithm.

Double DQN, also referred to as DDQN, was the first of many improvement proposed over the years for DQN. Back in 2015 when it was first introduced, DDQN obtained state-of-the-art (best at the moment) results in the ATARI domain.

Hado obtained his Ph.D. from the University of Utrecht in the Netherlands in Artificial Intelligence (Reinforcement Learning). After a couple of years as a postdoctoral researcher, he got a job at Google DeepMind as a Research Scientist.

 

Show Me The Math

DDQN loss function

 

ISpeak Python

Double DQN

A more forgiving loss function

In the previous chapter, we selected the L2 loss, also known as Mean Square Error (MSE) as our loss function mostly for its widespread use and simplicity. And, in reality, in a problem such as the cart-pole environment there might not be a good reason to look any further. However, because I'm teaching you the ins and outs of the algorithms and not just "how to hammer the nail", I'd like to also make you aware of the different knobs available so you can play around when tackling more challenging problems. MSE is a very common loss function because it is simple, it makes sense, and it works well. But, one of the issues with using MSE for reinforcement learning is that it penalizes large errors more than small errors. This makes sense when doing supervised learning because our targets are the true value from the get go, and are fixed throughout the training process. That means, we are confident that, if the model is very wrong, then it should be penalized more heavily than a simple error.
But in reinforcement learning, we do not have these true values and the values we use to train our network actually depend on the agent itself. That's a mind shift. In addition, targets are constantly changing, even when using target networks, they still change often. In reinforcement learning, being very wrong is something we expect and welcome. At the end of the day, if you think about it, we are not really "training" agents, our agents learn on their own. Think about that for a second.
A loss function not as unforgiving, and also more robust to outliers, is the Mean Absolute Error, also known as MAE or L1 loss. MAE is defined as the average absolute differences between the predicted and true values, that is, the predicted action-value function and the TD target. Given MAE is linear, as opposed to quadratic like MSE, we can expect MAE to be more successful at ignoring extreme values in errors. This can come in handy in our case because we expect our action-value function to give wrong values at some point during training, particularly at the beginning. Being more resilient to outliers often implies errors will have less effect, as compared to MSE, in terms of changes to our network, which means more stable learning. Now, on the flip side, one of the nice things of MSE that MAE does not have is the fact that its gradients decrease as the loss goes to zero. This is nice for optimization methods as it makes it easier to reach the optima because lower gradients mean smaller steps. But luckly for us, there is a loss function that is somewhat a mix of MSE and MAE, called the Huber loss. The Huber loss has the same nice property as MSE of quadratically penalizing the errors near zero, but it is not quadratic all the way out for very large errors. Instead, the Huber loss is quadratic (curved) near zero error and it becomes linear (straight) for errors larger than a pre-set threshold. Having the best of both worlds makes the Huber loss robust to outliers, just like MAE, and differentiable at 0, just like MSE.
The Huber loss uses a hyper-parameter, δ, to indicate the threshold in which the loss goes from quadratic to linear, basically, from MSE to MAE. If δ is zero, you are left exactly with MAE, and if δ is , then you are left exactly with MSE. A common value for δ is 1, but be aware that your loss function, optimization, and learning rate will have deep interactions. So, if you change one, you may need to tune some of the others. Check out the Notebook for this chapter so you can play around. Interestingly, there are two different ways to implement the Huber loss function that I'm aware of. You could either compute the Huber loss exactly as defined, or you could compute MSE instead, and then set all components of the gradient larger than a threshold to a fixed magnitude value. The former depends on the deep learning framework you use, but the problem is, some frameworks don’t give you access to the δ hyper-parameter, so you are stuck with δ set to 1, which doesn't always work. The latter, often referred to as "loss clipping", or better yet "gradient clipping", is more flexible and therefore what I implement in the Notebooks. Know that there is such a thing as "reward clipping" which is different than "gradient clipping". These are two very different things, so beware. One works on the rewards and the other on the errors (the loss). Now, above all is not to confuse either of these with "Q-value clipping", which is surely a mistake. Remember, in our case, the goal is to prevent gradients from becoming too large. For this, we either make the loss linear outside a given absolute TD error threshold, or make the gradient constant outside a max gradient magnitude threshold.
In the cart-pole environment experiments that you will find in the Notebooks, I implemented the Huber loss function by using the "gradient clipping" technique: That is I calculate MSE and then clipping the gradients. However, as I mentioned before, I set the hyper-parameter setting the maximum gradient values to infinity. Therefore, it is effectively using good-old MSE. But, please, experiment, play around, explore!

ISpeak Python

Double DQN with Huber Loss

 

It's In The Details

The full Double Deep Q-Network (DDQN) algorithm

 

DDQN is almost identical to DQN, but there are still some differences. We still:

·   Approximate the action-value function Q(s,a; w).

·   Use a state-in-values-out architecture (nodes: 4, 512,128, 2).

·   Optimize the action-value function to approximate the optimal action-value function q*(s,a).

·   Use off-policy TD targets (r + gamma*max_a'Q(s',a')) to evaluate policies.

Notice that we now:

·   Use an adjustable Huber loss, which since we set the `max_gradient_ norm` variable to `float('inf ')`, we are effectively just using mean squared error (MSE) for our loss function.

·   Use RMSprop as our optimizer with a learning rate of 0.0007. Note that before we used 0.0005 because without double learning (vanilla DQN) some seeds fail if we train with a learning rate of 0.0007. Perhaps stability? In DDQN, on the other hand, training with a higher learning rate works best.

In DDQN we are still using:

·   An exponentially decaying epsilon-greedy strategy (from 1.0 to 0.3 in roughly 20,000 steps) to improve policies.

·   A replay buffer with 320 samples min, 50,000 max, and a batch of 64.

·   A target network that freezes for 15 steps and then updates fully.

DDQN, just like DQN has the same 3 main steps:

 

  1)  Collect experience: (st, at, rt+1, st+1, dt+1), and insert into the replay buffer.

  2)  Pull a batch out of the buffer and calculate the off-policy TD targets: r + gamma*max_a'Q(s',a'). But, now using double learning.

  3)  Fit the action-value function Q(s,a; w): Using MSE and RMSprop.

 

The bottom line is the DDQN implementation and hyper-parameters are identical to those of DQN, except that we now use double learning and therefore train with a slightly higher learning rate. The addition of the Huber loss does not change anything because we are "clipping" gradients to a max value of infinite, which is equivalent to using MSE. However, for many other environments you will find it useful, so tune this hyper-parameter.

 

Tally

DDQN is more stable than NFQ or DQN

Things we can still improve on

Surely our current value-based deep reinforcement learning method is not perfect, but it is actually pretty good. DDQN can reach super-human performance in many of the ATARI games. To make that happen, you just have to change the network to make it possible to take images as input (a stack of 4 images to be able to infer direction/velocity/etc from the images, or you'd need to add a memory cell to the network), and, of course, tune the hyper-parameters. Yet, we can still go one step further. There are at least a couple other improvements to consider that are easy to implement and will positively impact performance. The first improvement requires us to reconsider the current network architecture. As of right now, we have a very naive representation of the Q-function on our neural network architecture.

Refresh My Memory

Current neural network architecture

 

State-in-values-out architecture

We are literatelly "making reinforcement learning look like supervised learning". But, we can, and should, break free from this constraint, and think out of the box: Is there any better way of representing the Q-function? Think about this for a second while you look at the images on the next page.

 

The images on the right are bar plots representing the action-value function Q, state-value function V, and the action-advantage function A for the cart-pole environment with a pseudo-random state in which the pole is near perfectly vertical. See the different functions and values and start thinking about how to better architect the neural network so that data is used more efficiently. As a hint, let me point out that the action-value function Q, has in common the state-value function V, because both actions in Q(s) and indexed by the same s (in the example to the right s=[0.02, -0.01, -0.02, -0.04]).
The question is, would you be able to learn anything about Q(s, 0) if you are using a Q(s, 1) sample? Look at the action-advantage function A(s) plot and notice how much easier it is for you to eyeball the greedy action than on the action-value function Q(s) plot. What can you do about this? In the next chapter, we will look at a network architecture to help us exploit these relationships in Dueling DQN.
The other thing to consider improving is the way we sample experiences from the replay buffer. As of now, we pull samples from the buffer uniformly at random, and I'm sure your intuition questions this approach and suggests to you that we can do better than this. We can.
If you think about it, humans don't go around the world just remembering random things to learn from. There is definitely a more systematic way in which intelligent agents "replay memories". I'm pretty sure my dog chases rabbits in her sleep. There are experiences that are more important than others to our goals. Humans often replay experiences that caused them unexpected joy or pain. And it makes sense, it is important for you to learn from these experiences to generate more or less of those. In the next chapter, we'll also look at ways of prioritizing the sampling of experiences to get the most out of samples with a Prioritized Experience Replay (PER) buffer.
Sign in for more free preview time

6.3   Summary

In this chapter, you learned about stabilizing value-based deep reinforcement learning methods. You dug deep on the components that make value-based methods more stable. You learned about replay buffers and target networks on an algorithm known as DQN ('Nature' DQN, or 'Vanilla' DQN). You then improved on this by implementing a double learning strategy that works efficiently when using function approximation in an algorithm called DDQN. In addition to these new algorithms, you learned about different exploration strategies to use with value-based methods. You learned about linearly and exponentially decaying epsilon-greedy, and SoftMax exploration strategies, this time, in the context of function approximation. Also, you learned about different loss functions and which ones make more sense for reinforcement learning and why. You learned that the Huber loss function allows you to tune between MSE and MAE with a single hyper-parameter, and it is therefore one of the prefer loss functions used in value-based deep reinforcement learning methods. By now you:
  • Xsn elsov rrienfnemtoec nieganrl rmloebsp jrdw usconntuoi taest-saspec pjwr ohgtlisarm zrbr tcv xtkm slebat sgn rerfoheet yooj tokm oicsettnns tlsurse.
  • Hxxs nz gnnuetdrasnid el astte-vl-uxr-srt eulav-bsaed hgxx eenmofrntiecr rlgnaein oedtmsh ngc xst xzqf rk ovlse coepmlx blopmres.
 
sitemap
×

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage