concept algorithm in category reinforcement learning

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub.
DRL is about algorithms, methods, techniques, tricks, and so on, so there is no point for us to re-write a “Numpy” or a “PyTorch” library. But, also, in this book, we write DRL algorithms from scratch; I’m not teaching you how to use a DRL library, such as Keras-RL, or Baselines, or RLlib. I want you to learn DRL, and therefore we write DRL code. In the years that I’ve been teaching RL, I’ve noticed those who write RL code are more likely to understand RL. Now, this is not a book on PyTorch either; there is no separate PyTorch review or anything like that, just PyTorch code that I explain as we move along. If you are
Up until this chapter, you have studied in isolation and interplay learning from two of the three different types of feedback a reinforcement learning agent must deal with: sequential, evaluative, and sampled. In chapter 2, you learned to represent sequential decision-making problems using a mathematical framework known as Markov Decision Processes. In chapter 3, you learned how to solve these problems with algorithms that extract policies from these MDPs. In chapter 4, you learned to solve simple control problems that are multi-option single-choice decision-making problems, called Multi-armed Bandits, when the MDP representation is not available to the agent. Finally, in chapter 5, we mixed these two types of control problems, that is, we dealt with control problems that are sequential and uncertain, but we only learned to estimate value functions. We solved what is called the Prediction Problem, which is basically learning to evaluate policies, learning to predict returns.
In this chapter, we will introduce agents that solve the Control Problem, which we get simply by changing two things. First, instead of estimating state-value functions, V(s), we estimate action-value functions, Q(s, a). The main reason for this is that Q-functions, unlike V-functions, let us see the value of actions without having to use an MDP. Second, after we obtain these Q-value estimates, we use them to improve the policies. This is very similar to what we did in the policy iteration algorithm: we evaluate, we improve, then evaluate the improved policy, then improve on this improved policy, and so on. As I mentioned in chapter 2, this pattern is called Generalized Policy Iteration (GPI), and it can help us create an architecture that virtually any reinforcement learning algorithm, including state-of-the- art deep reinforcement learning agents, fits under.
Figure 13.8 Some model-based reinforcement learning algorithms to have in mind
![]()

This is an excerpt from Manning's book Deep Reinforcement Learning in Action.
Deep learning algorithms, which are also called artificial neural networks, are relatively simple mathematical functions and mostly just require an understanding of vectors and matrices. Training a neural network, however, requires an understanding of the basics of calculus, namely the derivative. The fundamentals of applied deep learning therefore require only knowing how to multiply vectors and matrices and take the derivative of multivariable functions, which we’ll review here. Theoretical machine learning refers to the field that rigorously studies the properties and behavior of machine learning algorithms and yields new approaches and algorithms. Theoretical machine learning involves advanced graduate-level mathematics that covers a wide variety of mathematical disciplines that are outside the scope of this book. In this book we only utilize informal mathematics in order to achieve our practical aims, not rigorous proof-based mathematics.
In 2013, DeepMind published a paper entitled “Playing Atari with Deep Reinforcement Learning” that outlined their new approach to an old algorithm, which gave them enough performance to play six of seven Atari 2600 games at record levels. Crucially, the algorithm they used only relied on analyzing the raw pixel data from the games, just like a human would. This paper really set off the field of deep reinforcement learning.
Imagine that our algorithm is training on (learning Q values for) game 1 of figure 3.12. The player is placed between the pit and the goal such that the goal is on the right and the pit is on the left. Using an epsilon-greedy strategy, the player takes a random move and by chance steps to the right and hits the goal. Great! The algorithm will try to learn that this state-action pair is associated with a high value by updating its weights in such a way that the output will more closely match the target value (i.e. via backpropagation).
Figure 3.12. The idea of catastrophic forgetting is that when two game states are very similar and yet lead to very different outcomes, the Q function will get “confused” and won’t be able to learn what to do. In this example, the catastrophic forgetting happens because the Q function learns from game 1 that moving right leads to a +1 reward, but in game 2, which looks very similar, it gets a reward of –1 after moving right. As a result, the algorithm forgets what it previously learned about game 1, resulting in essentially no significant learning at all.
![]()
We trained for 5,000 epochs this time, since it’s a more difficult game, but otherwise the Q-network model is the same as before. When we test the algorithm, it seems to play most of the games correctly. We wrote an additional testing script to see what percentage of games it wins out of 1,000 plays.
Listing 3.6. Testing the performance with experience replay
max_games = 1000 wins = 0 for i in range(max_games): win = test_model(model, mode='random', display=False) if win: wins += 1 win_perc = float(wins) / float(max_games) print("Games played: {0}, # of wins: {1}".format(max_games,wins)) print("Win percentage: {}".format(win_perc))When we run listing 3.6 on our trained model (trained for 5,000 epochs), we get about 90% accuracy. Your accuracy may be slightly better or worse. This certainly suggests it has learned something about how to play the game, but it’s not exactly what we would expect if the algorithm really knew what it was doing (although you could probably improve the accuracy with a much longer training time). Once you actually know how to play, you should be able to win every single game.
The Q function must be learned from data. The Q function has to learn how to make accurate Q value predictions of states. The Q function could be anything really—anything from an unintelligent database to a complex deep learning algorithm. Since deep learning is the best class of learning algorithms we have at the moment, we employed neural networks as our Q functions. This means that “learning the Q function” is the same as training a neural network with backpropagation.
One important concept about Q-learning that we held back until now is that it is an off-policy algorithm, in contrast to an on-policy algorithm. You already know what a policy is from the last chapter: it’s the strategy an algorithm uses to maximize rewards over time. If a human is learning to play Gridworld, they might employ a policy that first scouts all possible paths toward the goal and then selects the one that is shortest. Another policy might be to randomly take actions until you land on the goal.
An off-policy reinforcement learning algorithm like Q-learning means that the choice of policy does not affect the ability to learn accurate Q values. Indeed, our Q-network could learn accurate Q values if we selected actions at random; eventually it would experience a number of winning and losing games and infer the values of states and actions. Of course, this is terribly inefficient, but the policy matters only insofar as it helps us learn with the least amount of data. In contrast, an on-policy algorithm will explicitly depend on the choice of policy or will directly aim at learning a policy from the data. In other words, in order to train our DQN, we need to collect data (experiences) from the environment, and we could do this using any policy, so DQN is off-policy. In contrast, an on-policy algorithm learns a policy while simultaneously using the same policy to collect experiences for training itself.