3 Balancing immediate and long-term goals

In this chapter

You will learn about the challenges of learning from sequential feedback and how to properly balance immediate and long-term goals.
You will develop algorithms that can find the best policies of behavior in sequential decision-making problems modeled with MDPs.
You will find the optimal policies for all environments for which you built MDPs in the previous chapter.

In preparing for battle I have always found that plans are useless, but planning is indispensable.

— Dwight D. Eisenhower United States Army five-star general and 34th President of the United States

In the last chapter, you built an MDP for the BW, BSW, and FL environments. MDPs are the motors moving RL environments. They define the problem: they describe how the agent interacts with the environment through state and action spaces, the agent’s goal through the reward function, how the environment reacts from the agent’s actions through the transition function, and how time should impact behavior through the discount factor.

In this chapter, you’ll learn about algorithms for solving MDPs. We first discuss the objective of an agent and why simple plans are not sufficient to solve MDPs. We then talk about the two fundamental algorithms for solving MDPs under a technique called dynamic policy iteration (PI).

The objective of a decision-making agent

Policies: Per-state action prescriptions

State-value function: What to expect from here?

3 Balancing immediate and long-term goals

In this chapter

The objective of a decision-making agent

Policies: Per-state action prescriptions

State-value function: What to expect from here?

Action-value function: What should I expect from here if I do this?

Action-advantage function: How much better if I do that?

Optimality

Planning optimal sequences of actions

Policy evaluation: Rating policies

Policy improvement: Using ratings to get better

Policy iteration: Improving upon improved behaviors

Value iteration: Improving behaviors early

Summary