concept objective in category reinforcement learning

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub.
The objective of a reinforcement learning agent is to maximize the expected return, which is the total reward over multiple episodes. For this, agents must use policies, which can be thought of as universal plans. Policies prescribe actions for states. They can be deterministic, meaning they return single actions, or stochastic, they return probability distributions. To obtain policies, agents usually keep track of several summary values. The main ones are state-value, action-value, and action-advantage functions.

This is an excerpt from Manning's book Deep Reinforcement Learning in Action.
Figure 1.8. The standard framework for RL algorithms. The agent takes an action in the environment, such as moving a chess piece, which then updates the state of the environment. For every action it takes, it receives a reward (e.g., +1 for winning the game, –1 for losing the game, 0 otherwise). The RL algorithm repeats this process with the objective of maximizing rewards in the long term, and it eventually learns how the environment works.
![]()
A reward is a positive or negative signal given to an agent by the environment after it takes an action. The rewards are the only learning signals the agent is given. The objective of an RL algorithm (i.e., the agent) is to maximize rewards.
Mathematically, what we’ve described is correct. But due to computation imprecisions we need to make adjustments to this formula to stabilize the training. One problem is that probabilities are bounded by 0 and 1 by definition, so the range of values that the optimizer can operate over is limited and small. Sometimes probabilities may be extremely tiny or very close to 1, and this runs into numerical issues when optimizing on a computer with limited numerical precision. If we instead use a surrogate objective, namely –logπs(a|θ) (where log is the natural logarithm), we have an objective that has a larger “dynamic range” than raw probability space, since the log of probability space ranges from (–∞,0), and this makes the log probability easier to compute. Moreover, logarithms have the nice property that log(a × b) = log(a) + log(b), which means when we multiply log probabilities, we can turn this multiplication into a sum, which is also more numerically stable than multiplication. If we set our objective as –logπs(a|θ) instead of 1 – πs(a|θ), our loss still abides by the intuition that the loss function approaches 0 as πs(a|θ) approaches 1. Our gradients will be tuned to try to increase πs(a|θ) to 1, where a = action 3, for our running example.