concept model in category reinforcement learning

appears as: model, models, model, models, A model, The model
Grokking Deep Reinforcement Learning MEAP V14 epub

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub.

Supervised learning (SL) is the task of learning from labeled data. In SL, a human decides which data to collect and how to label it. The goal in SL is to generalize. A classic example of SL is a handwritten-digit recognition application; a human gathers images with handwritten digits, labels those images, and trains a model to recognize and classify digits in images correctly. The trained model is expected to generalize and correctly classify handwritten digits in new images.

Unsupervised learning (UL) is the task of learning from unlabeled data. Even though data no longer needs labeling, the methods used by the computer to gather data still need to be designed by a human. The goal in UL is to compress. A classic example of UL is a customer segmentation application; a human collects customer data and trains a model to group customers into clusters. These clusters compress the information uncovering underlying relationships in customers.

Second, we explore algorithms that use experience samples to learn a model of the environment, a Markov Decision Process (MDP.) By doing so, these methods extract the most out of the data they collect and often arrive at optimality more quickly than methods that don’t. The group of algorithms that attempt to learn a model of the environment is referred to as model-based reinforcement learning.

Deep Reinforcement Learning in Action

This is an excerpt from Manning's book Deep Reinforcement Learning in Action.

Deep learning models are just one of many kinds of machine learning models we can use to classify images. In general, we just need some sort of function that takes in an image and returns a class label (in this case, the label identifying which kind of animal is depicted in the image), and usually this function has a fixed set of adjustable parameters—we call these kinds of models parametric models. We start with a parametric model whose parameters are initialized to random values—this will produce random class labels for the input images. Then we use a training procedure to adjust the parameters so the function iteratively gets better and better at correctly classifying the images. At some point, the parameters will be at an optimal set of values, meaning that the model cannot get any better at the classification task. Parametric models can also be used for regression, where we try to fit a model to a set of data so we can make predictions for unseen data (figure 1.2). A more sophisticated approach might perform even better if it had more parameters or a better internal architecture.

Figure 1.2. Perhaps the simplest machine learning model is a simple linear function of the form f(x) = mx + b, with parameters m (the slope) and b (the intercept). Since it has adjustable parameters, we call it a parametric function or model. If we have some 2-dimensional data, we can start with a randomly initialized set of parameters, such as [m = 3.4, b = 0.3], and then use a training algorithm to optimize the parameters to fit the training data, in which case the optimal set of parameters is close to [m = 2, b = 1].

Deep neural networks are popular because they are in many cases the most accurate parametric machine learning models for a given task, like image classification. This is largely due to the way they represent data. Deep neural networks have many layers (hence the “deep”), which induces the model to learn layered representations of input data. This layered representation is a form of compositionality, meaning that a complex piece of data is represented as the combination of more elementary components, and those components can be further broken down into even simpler components, and so on, until you get to atomic units.

As we mentioned in the introduction, our goal in this chapter is to implement a model called distributed advantage actor-critic (DA2C), and we’ve discussed the “advantage actor-critic” part of the name at a conceptual level. Let’s do the same for the “distributed” part now.

Figure 5.8. The most common form of training a deep learning model is to feed a batch of data together into the model to return a batch of predictions. Then we compute the loss for each prediction and average or sum all the losses before backpropagating and updating the model parameters. This averages out the variability present across all the experiences. Alternatively, we can run multiple models with each taking a single experience and making a single prediction, backpropagate through each model to get the gradients, and then sum or average the gradients before making any parameter updates.

As with the rest of the deep learning methods, we generally must use batches of data in order to effectively train. Training with a single example a time introduces too much noise, and the training will likely never converge. To introduce batch training with Q-learning we used an experience replay buffer that could randomly select batches of previous experiences. We could have used experience replay with actor-critic, but it is more common to use distributed training with actor-critic (and, to be clear, Q-learning can also be distributed). Distributed training in actor-critic models is more common because we often want to use a recurrent neural network (RNN) layer as part of our reinforcement learning model in cases where keeping track of prior states is necessary or helpful in achieving the goal. But RNNs need a sequence of temporally related examples, and experience replay relies on a batch of independent experiences. We could store entire trajectories (sequences of experiences) in a replay buffer, but that just adds complexity. Instead, with distributed training and each process running online with its own environment, the models can easily incorporate RNNs.

Listing 6.13. Training the models
num_generations = 25                                                       #1
population_size = 500                                                      #2
mutation_rate = 0.01
pop_fit = []
pop = spawn_population(N=population_size,size=407)                         #3
for i in range(num_generations):
    pop, avg_fit = evaluate_population(pop)                                #4
    pop_fit.append(avg_fit)
    pop = next_generation(pop, mut_rate=mutation_rate,tournament_size=0.2) #5
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage