3 Balancing immediate and long-term goals

 

In this chapter:

  • You learn about the challenges of learning from sequential feedback and how to properly balance immediate and long-term goals.
  • You develop algorithms that can find the best policies of behavior in sequential decision-making problems modeled with MDPs.
  • You find the optimal policies for all environments you built MDPs for in the previous chapter.

In preparing for battle I have always found that plans are useless, but planning is indispensable.

— Dwight D. Eisenhower

United States Army five-star general and 34th President of the United States

In the last chapter, you built an MDP for the BW, BSW, and FL environments. MDPs are the motors moving RL environments. They define the problem: they describe how the agent interacts with the environment through state and action spaces, what is the agent's goal through the reward function, how the environment reacts from the agent's actions through the transition function, and how time should impact behavior through the discount factor.

In this chapter, you'll learn about algorithms for solving MDPs. We first discuss the objective of an agent and why simple plans are not sufficient to solve MDPs. We then talk about the two fundamental algorithms for solving MDPs under a technique called Dynamic Programming: Value Iteration (VI) and Policy Iteration (PI).

3.1   The objective of a decision-making agent

3.1.1   Policies: Per-state action prescriptions

3.1.2   State-value function: What to expect from here?

3.1.3   Action-value function: What to expect from here if I do this?

3.1.4   Action-advantage function: How much better if I do that?

3.1.5   Optimality

3.2   Planning optimal sequences of actions

3.2.1   Policy Evaluation: Rating policies

3.2.2   Policy Improvement: Using ratings to get better

3.2.3   Policy Iteration: Improving upon improved behaviors

3.2.4   Value Iteration: Improving behaviors early

3.3   Summary

sitemap