Chapter 2. Modeling reinforcement learning problems: Markov decision processes

This chapter covers

String diagrams and our teaching methods
The PyTorch deep learning framework
Solving n-armed bandit problems
Balancing exploration versus exploitation
Modeling a problem as a Markov decision process (MDP)
Implementing a neural network to solve an advertisement selection problem

This chapter covers some of the most fundamental concepts in all of reinforcement learning, and it will be the basis for the rest of the book. But before we get into that, we want to first go over some of the recurring teaching methods we’ll employ in this book—most notably, the string diagrams we mentioned last chapter.

2.1. String diagrams and our teaching methods

In our experience, when most people try to teach something complicated, they tend to teach it in the reverse order in which the topic itself was developed. They’ll give you a bunch of definitions, terms, descriptions, and perhaps theorems, and then they’ll say, “great, now that we’ve covered all the theory, let’s go over some practice problems.” In our opinion, that’s exactly the opposite order in which things should be presented. Most good ideas arise as solutions to real-world problems, or at least imagined problems. The problem-solver stumbles across a potential solution, tests it, improves it, and then eventually formalizes and possibly mathematizes it. The terms and definitions come after the solution to the problem was developed.

Chapter 2. Modeling reinforcement learning problems: Markov decision processes

This chapter covers

2.1. String diagrams and our teaching methods

2.2. Solving the multi-arm bandit

2.3. Applying bandits to optimize ad placements

2.4. Building networks with PyTorch

2.5. Solving contextual bandits

2.6. The Markov property

2.7. Predicting future rewards: Value and policy functions

Summary

Chapter 2. Modeling reinforcement learning problems: Markov decision processes

This chapter covers

2.1. String diagrams and our teaching methods

2.2. Solving the multi-arm bandit

2.3. Applying bandits to optimize ad placements

2.4. Building networks with PyTorch

2.5. Solving contextual bandits

2.6. The Markov property

2.7. Predicting future rewards: Value and policy functions

Summary

Unable to load book!