Chapter 2. Modeling reinforcement learning problems: Markov decision processes


This chapter covers

  • String diagrams and our teaching methods
  • The PyTorch deep learning framework
  • Solving n-armed bandit problems
  • Balancing exploration versus exploitation
  • Modeling a problem as a Markov decision process (MDP)
  • Implementing a neural network to solve an advertisement selection problem

This chapter covers some of the most fundamental concepts in all of reinforcement learning, and it will be the basis for the rest of the book. But before we get into that, we want to first go over some of the recurring teaching methods we’ll employ in this book—most notably, the string diagrams we mentioned last chapter.

2.1. String diagrams and our teaching methods

In our experience, when most people try to teach something complicated, they tend to teach it in the reverse order in which the topic itself was developed. They’ll give you a bunch of definitions, terms, descriptions, and perhaps theorems, and then they’ll say, “great, now that we’ve covered all the theory, let’s go over some practice problems.” In our opinion, that’s exactly the opposite order in which things should be presented. Most good ideas arise as solutions to real-world problems, or at least imagined problems. The problem-solver stumbles across a potential solution, tests it, improves it, and then eventually formalizes and possibly mathematizes it. The terms and definitions come after the solution to the problem was developed.

2.2. Solving the multi-arm bandit

2.3. Applying bandits to optimize ad placements

2.4. Building networks with PyTorch

2.5. Solving contextual bandits

2.6. The Markov property

2.7. Predicting future rewards: Value and policy functions