3 Balancing immediate and long-term goals
In this chapter
- You will learn about the challenges of learning from sequential feedback and how to properly balance immediate and long-term goals.
- You will develop algorithms that can find the best policies of behavior in sequential decision-making problems modeled with MDPs.
- You will find the optimal policies for all environments for which you built MDPs in the previous chapter.
In preparing for battle I have always found that plans are useless, but planning is indispensable.
— Dwight D. Eisenhower United States Army five-star general and 34th President of the United States
In the last chapter, you built an MDP for the BW, BSW, and FL environments. MDPs are the motors moving RL environments. They define the problem: they describe how the agent interacts with the environment through state and action spaces, the agent’s goal through the reward function, how the environment reacts from the agent’s actions through the transition function, and how time should impact behavior through the discount factor.
In this chapter, you’ll learn about algorithms for solving MDPs. We first discuss the objective of an agent and why simple plans are not sufficient to solve MDPs. We then talk about the two fundamental algorithms for solving MDPs under a technique called dynamic policy iteration (PI).