chapter two

2 Modeling Reinforcement Learning Problems: Markov Decision Processes

This chapter covers:

String diagrams and our teaching methods
The PyTorch deep learning framework
Solving N-armed bandit problems
Balancing exploration versus exploitation
Modeling a problem as a Markov decision process (MDP)
Implementing a neural network to solve an advertisement selection problem

2.1 String Diagrams and our teaching methods

This chapter covers some of the most fundamental concepts in all of reinforcement learning and will the basis for the rest of the book. But before we get into that, we want to first go over some of the recurring teaching methods we’ll employ in this book, most notably, the string diagrams we mentioned last chapter.

From our experience, when most people try to teach something complicated they tend to teach it in the reverse order in which the topic itself was developed. They’ll give you a bunch of definitions, terms, descriptions and perhaps theorems and then they’ll say, “great, now that we’ve covered all the theory, let’s go over some practice problems.” In our opinion, that’s exactly the opposite order in which things should be presented. Most good ideas arise as solutions to real problems in the world, or at least imagined problems. The problem-solver stumbles across a potential solution, tests it, improves it, and then eventually it gets formalized and possibly mathematized. All the terms and definitions come after the solution to the problem was thought of.

2 Modeling Reinforcement Learning Problems: Markov Decision Processes

This chapter covers:

2.1 String Diagrams and our teaching methods

2.2 Solving the Multi-Arm Bandit

2.3 Applying Bandits to Optimize Ad Placements

2.4 Building Networks with PyTorch

2.5 Solving Contextual Bandits

2.6 The Markov Property

2.7 Predicting Future Rewards: Value and Policy Functions

2.8 Chapter Summary

2.9 What’s next?