concept `objective` in category `reinforcement learning`

appears as: objective, The objective, n objective

Grokking Deep Reinforcement Learning MEAP V14 epub

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub. Login to get full access to this book.

The objective of a reinforcement learning agent is to maximize the expected return, which is the total reward over multiple episodes. For this, agents must use policies, which can be thought of as universal plans. Policies prescribe actions for states. They can be deterministic, meaning they return single actions, or stochastic, they return probability distributions. To obtain policies, agents usually keep track of several summary values. The main ones are state-value, action-value, and action-advantage functions.

to see more go to 3 Balancing immediate and long-term goals

Deep Reinforcement Learning in Action

This is an excerpt from Manning's book Deep Reinforcement Learning in Action. Login to get full access to this book.

Figure 1.8. The standard framework for RL algorithms. The agent takes an action in the environment, such as moving a chess piece, which then updates the state of the environment. For every action it takes, it receives a reward (e.g., +1 for winning the game, –1 for losing the game, 0 otherwise). The RL algorithm repeats this process with the objective of maximizing rewards in the long term, and it eventually learns how the environment works.

to see more go to Chapter 1. What is reinforcement learning?

A reward is a positive or negative signal given to an agent by the environment after it takes an action. The rewards are the only learning signals the agent is given. The objective of an RL algorithm (i.e., the agent) is to maximize rewards.

to see more go to 1.8. What’s next?

Mathematically, what we’ve described is correct. But due to computation imprecisions we need to make adjustments to this formula to stabilize the training. One problem is that probabilities are bounded by 0 and 1 by definition, so the range of values that the optimizer can operate over is limited and small. Sometimes probabilities may be extremely tiny or very close to 1, and this runs into numerical issues when optimizing on a computer with limited numerical precision. If we instead use a surrogate objective, namely –logπ_s(a|θ) (where log is the natural logarithm), we have an objective that has a larger “dynamic range” than raw probability space, since the log of probability space ranges from (–∞,0), and this makes the log probability easier to compute. Moreover, logarithms have the nice property that log(a × b) = log(a) + log(b), which means when we multiply log probabilities, we can turn this multiplication into a sum, which is also more numerically stable than multiplication. If we set our objective as –logπ_s(a|θ) instead of 1 – π_s(a|θ), our loss still abides by the intuition that the loss function approaches 0 as π_s(a|θ) approaches 1. Our gradients will be tuned to try to increase π_s(a|θ) to 1, where a = action 3, for our running example.

to see more go to Chapter 4. Learning to pick the best policy: Policy gradient methods

concept objective in category reinforcement learning

Grokking Deep Reinforcement Learning MEAP V14 epub

Deep Reinforcement Learning in Action

Unable to load book!

concept `objective` in category `reinforcement learning`