4 Perfect knowledge, optimal policy: dynamic programming

 

This chapter covers

  • Break problems intro stages using Bellman’s principles.
  • Solve Markov decision process with policy and value iteration.
  • Apply dynamic programming to resource allocation with numerical example and code
  • Understand dynamic programming limitations
In preparing for battle, I have always found that plans are useless, but planning is indispensable.

Dwight D. Eisenhower, 34th president of United States

So far, we’ve learned how to formulate a sequential decision-making problem using the Markov Decision Process framework. We also explored how to build custom environments that reflect the complexity of the problems we want to solve. Now, it’s time to tackle the next big step: learning how to actually solve problems once they’ve been cast into the Markov decision process framework. Broadly speaking, there are two major categories of solution methods:

  • Model-based methods: These rely on having or learning a model of the environment — which is combination of transition probabilities and reward function — and use that model to plan ahead.
  • Model-free methods: These bypass the model entirely and only receive reward. Instead of relying on a known transition function, they learn purely from experience — from actual sampled transitions. Experience in this context means tuple of (State, Action, Reward, Next State).

4.1 Paradigms on solving Markov decision process

4.2 The domino decision rule: Bellman equations

4.3 Solving bellman equations: Generalized Policy Iteration

4.4 Hands-on code: solving a resource allocation problem

4.5 Limitations of dynamic programming

4.6 Summary