chapter two

2 Modeling Reinforcement Learning Problems: Markov Decision Processes

 

This chapter covers:

  • String diagrams and our teaching methods
  • The PyTorch deep learning framework
  • Solving N-armed bandit problems
  • Balancing exploration versus exploitation
  • Modeling a problem as a Markov decision process (MDP)
  • Implementing a neural network to solve an advertisement selection problem

2.1   String Diagrams and our teaching methods

This chapter covers some of the most fundamental concepts in all of reinforcement learning and will the basis for the rest of the book. But before we get into that, we want to first go over some of the recurring teaching methods we’ll employ in this book, most notably, the string diagrams we mentioned last chapter.

From our experience, when most people try to teach something complicated they tend to teach it in the reverse order in which the topic itself was developed. They’ll give you a bunch of definitions, terms, descriptions and perhaps theorems and then they’ll say, “great, now that we’ve covered all the theory, let’s go over some practice problems.” In our opinion, that’s exactly the opposite order in which things should be presented. Most good ideas arise as solutions to real problems in the world, or at least imagined problems. The problem-solver stumbles across a potential solution, tests it, improves it, and then eventually it gets formalized and possibly mathematized. All the terms and definitions come after the solution to the problem was thought of.

2.2   Solving the Multi-Arm Bandit

2.3   Applying Bandits to Optimize Ad Placements

2.4   Building Networks with PyTorch

2.5   Solving Contextual Bandits

2.6   The Markov Property

2.7   Predicting Future Rewards: Value and Policy Functions

2.8   Chapter Summary

2.9   What’s next?