concept target network in category reinforcement learning

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub.
Figure 9.3 Q-function approximation with a target network
Figure 9.2 Q-function optimization without a target network
![]()
![]()
By using a target network to fix targets, we mitigate the issue of “chasing your own tail” by artificially creating several small supervised learning problems presented sequentially to the agent. Our targets are fixed for as many steps as we fix our target network. This improves our chances of convergence, not to the optimal values because such things don’t exist with non-linear function approximation, but convergence in general. But, more importantly, it substantially reduces the chances of divergence, which are not uncommon in value-based deep reinforcement learning methods.
Show Me The Math
Target network gradient update
Figure 9.4
It is important to note that in practice, we don’t have two “networks,” but instead, we have two instances of the neural network weights. We use the same model architecture and frequently update the weights of the target network to match the weights of the online network, which is the network we optimize on every step. “Frequently” here means something different depending on the problem, unfortunately. It is common to freeze these target network weights for 10 to 10,000 steps at a time, again depending on the problem (that’s time steps, not episodes. Be careful there.) If you are using a convolutional neural network, such as what you’d use for learning in ATARI games, then a 10,000-step frequency is the norm. But for more straightforward problems such as the cart-pole environment, 10- 20 steps is more appropriate.
By using target networks, we prevent the training process from spiraling around because we are fixing the targets for multiple time steps, thus allowing the online network weights to move consistently towards the targets before an update changes the optimization problem, and a new one is set. By using target networks, we stabilize training, but we also slow down learning because you are no longer training on up-to-date values; the frozen weights of the target network can be lagging for up-to 10,000 steps at a time. It’s is essential to balance stability and speed and tune this hyperparameter.
Figure 9.3 Q-function approximation with a target network
Figure 9.2 Q-function optimization without a target network
![]()
![]()
10.1.7 Continuously updating the target network
Currently, our agent is using a target network that can be outdated for several steps before it gets a big weight update when syncing with the online network. In the cart-pole environment, that is merely ~15 steps apart, but in more complex environments, that number can rise to tens of thousands.

This is an excerpt from Manning's book Deep Reinforcement Learning in Action.
This is not just a theoretical issue—it’s something that DeepMind observed in their own training. The solution they devised is to duplicate the Q-network into two copies, each with its own model parameters: the “regular” Q-network and a copy called the target network (symbolically denoted
-network, read “Q hat”). The target network is identical to the Q-network at the beginning, before any training, but its own parameters lag behind the regular Q-network in terms of how they’re updated.
Let’s run through the sequence of events again, with the target network in play (we’ll leave out the details of experience replay):
Figure 3.15. This is the general overview for Q-learning with a target network. It’s a fairly straightforward extension of the normal Q-learning algorithm, except that you have a second Q-network called the target network whose predicted Q values are used to backpropagate through and train the main Q-network. The target network’s parameters are not trained, but they are periodically synchronized with the Q-network’s parameters. The idea is that using the target network’s Q values to train the Q-network will improve the stability of the training.
![]()
Listing 3.8. DQN with experience replay and target network
from collections import deque epochs = 5000 losses = [] mem_size = 1000 batch_size = 200 replay = deque(maxlen=mem_size) max_moves = 50 h = 0 sync_freq = 500 #1 j=0 for i in range(epochs): game = Gridworld(size=4, mode='random') state1_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0 state1 = torch.from_numpy(state1_).float() status = 1 mov = 0 while(status == 1): j+=1 mov += 1 qval = model(state1) qval_ = qval.data.numpy() if (random.random() < epsilon): action_ = np.random.randint(0,4) else: action_ = np.argmax(qval_) action = action_set[action_] game.makeMove(action) state2_ = game.board.render_np().reshape(1,64) + np.random.rand(1,64)/100.0 state2 = torch.from_numpy(state2_).float() reward = game.reward() done = True if reward > 0 else False exp = (state1, action_, reward, state2, done) replay.append(exp) state1 = state2 if len(replay) > batch_size: minibatch = random.sample(replay, batch_size) state1_batch = torch.cat([s1 for (s1,a,r,s2,d) in minibatch]) action_batch = torch.Tensor([a for (s1,a,r,s2,d) in minibatch]) reward_batch = torch.Tensor([r for (s1,a,r,s2,d) in minibatch]) state2_batch = torch.cat([s2 for (s1,a,r,s2,d) in minibatch]) done_batch = torch.Tensor([d for (s1,a,r,s2,d) in minibatch]) Q1 = model(state1_batch) with torch.no_grad(): Q2 = model2(state2_batch) #2 Y = reward_batch + gamma * ((1-done_batch) * \ torch.max(Q2,dim=1)[0]) X = Q1.gather(dim=1,index=action_batch.long() \ .unsqueeze(dim=1)).squeeze() loss = loss_fn(X, Y.detach()) print(i, loss.item()) clear_output(wait=True) optimizer.zero_grad() loss.backward() losses.append(loss.item()) optimizer.step() if j % sync_freq == 0: #3 model2.load_state_dict(model.state_dict()) if reward != -1 or mov > max_moves: status = 0 mov = 0 losses = np.array(losses)