concept probability in category reinforcement learning

appears as: The probability, probability, probabilities, probability, probabilities
Grokking Deep Reinforcement Learning MEAP V14 epub

This is an excerpt from Manning's book Grokking Deep Reinforcement Learning MEAP V14 epub.

The probability of the next state, given the current state and action, is independent of the history of interactions. This memoryless property of MDPs is known as the Markov property: the probability of moving from one state s to another state s’ on two separate occasions, given the same action a, is the same regardless of all previous states or actions encountered before that point.

Deep Reinforcement Learning in Action

This is an excerpt from Manning's book Deep Reinforcement Learning in Action.

  • Rewards are signals produced by the environment that indicate the relative success of taking an action in a given state. An expected reward is a statistical concept that informally refers to the long-term average value of some random variable X (in our case, the reward), denoted E[X]. For example, in the n-armed bandit case, E[R|a] (the expected reward given action a) is the long-term average reward of taking each of the n-actions. If we knew the probability distribution over the actions, a, then we could calculate the precise value of the expected reward for a game of N plays as , where N is the number of plays of the game, pi refers to the probability of action ai, and r refers to the maximum possible reward.
  • Probability is actually a very rich and even controversial topic in its own right. There are varying philosophical opinions on exactly what probability means. To some people, the probability means that, if you were to flip a coin a very large number of times (ideally an infinite number of times, mathematically speaking) the probability of a fair coin turning up heads is equal to the proportion of heads in that infinitely long sequence of flips. That is, if we flip a fair coin 1,000,000 times, we would expect about half of the flips to be heads and the other half tails, so the probability is equal to that proportion. This is a frequentist interpretation of probability, since probability is interpreted as the long-term frequency of some event repeated many times.

    Another school of thought interprets probability only as a degree of belief, a subjective assessment of how much someone can predict an event given the knowledge they currently possess. This degree of belief is often called a credence. The probability of a fair coin turning up heads is 0.5 or 50% because, given what we know about the coin, we don’t have any reason to predict heads more than tails, or tails more than heads, so we split our belief evenly across the two possible outcomes. Hence, anything that we can’t predict deterministically (i.e., with probability 0 or 1, and nothing in between) results from a lack of knowledge.

    You’re free to interpret probabilities however you want, since it won’t affect our calculations, but in this book we tend to implicitly use the credence interpretation of probability. For our purposes, applying a probability distribution over the set of actions in Gridworld, A = {up,down,left,right} means we’re assigning a degree of belief (a real number between 0 and 1) to each action in the set such that all the probabilities sum to 1. We interpret these probabilities as the probability that an action is the best action to maximize the expected rewards, given that we’re in a certain state.

    A naive approach might be to make a target action distribution, [0, 0, 0, 1], so that our gradient descent will move the probabilities from [0.25, 0.25, 0.25, 0.25] close to [0, 0, 0, 1], maybe ending up as [0.167, 0.167, 0.167, 0.5] (see figure 4.6). This is something we often do in the supervised learning realm, when we are training a softmax-based image classifier. But in that case, there is a single correct classification for an image, and there is no temporal association between each prediction. In our RL case, we want more control over how we make these updates. First, we want to make small, smooth updates because we want to maintain some stochasticity in our action sampling to adequately explore the environment. Second, we want to be able to weight how much we assign credit to each action for earlier actions. Let’s review some more notation before diving into these two problems.

    Figure 4.6. Once an action is sampled from the policy network’s probability distribution, it produces a new state and reward. The reward signal is used to reinforce the action that was taken, that is, it increases the probability of that action given the state if the reward is positive, or it decreases the probability if the reward is negative. Notice that we only received information about action 3 (element 4), but since the probabilities must sum to 1, we have to lower the probabilities of the other actions.
  • Probability is a way of assigning degrees of belief about different possible outcomes in an unpredictable process. Each possible outcome is assigned a probability in the interval [0,1] such that all probabilities for all outcomes sum to 1. If we believe a particular outcome is more likely than another, we assign it a higher probability. If we receive new information, we can change our assignments of probabilities.
  • sitemap

    Unable to load book!

    The book could not be loaded.

    (try again in a couple of minutes)

    manning.com homepage
    test yourself with a liveTest