chapter six

6 Scheduling with tabular reinforcement learning

This chapter covers

Temporal-difference learning and its importance.
On-policy vs. off-policy methods and their differences.
Implementing Q-learning and SARSA.
Implementing eligibility traces.
Temporal Difference and why it is important.

When you take risks, you learn that there will be times when you succeed, and there will be times when you fail, and both are equally important.

Ellen DeGeneres, American comedian.

If we had to distill reinforcement learning down to its beating heart—the concepts without which the entire field would collapse—two ideas would stand out above all others: generalized policy iteration and temporal difference learning. These are not just passing technical details; they are the pillars on which almost every reinforcement learning algorithm rests. Whether it’s a simple method you can run on a whiteboard or a massive system like AlphaGo (an AI program developed by Google DeepMind that defeated human Go world champions), you’ll find traces of generalized policy iteration, temporal difference learning, or more often, both.

We’ve already taken the time to unpack generalized policy iteration: the elegant dance between policy evaluation and policy improvement. Now, it’s time to roll up our sleeves and dive into the other half of this story—the one idea I would confidently call the most important concept in reinforcement learning: temporal-difference learning.

6.1 Temporal difference learning

6 Scheduling with tabular reinforcement learning

This chapter covers

6.1 Temporal difference learning

6.2 A concrete example: restaurant table scheduling

6.3 Off policy vs on policy learning

6.4 Tabular Reinforcement Learning: Q-learning and SARSA

6.4.1 SARSA: learning from what you actually do

6.4.2 Q-learning: learning from what you should do

6.5 TD(λ) and Eligibility traces

6.6 Gas station fuel purchase scheduling with tabular methods

6.7 Summary