6 Scheduling with tabular reinforcement learning
This chapter covers
- Temporal-difference learning and its importance.
- On-policy vs. off-policy methods and their differences.
- Implementing Q-learning and SARSA.
- Implementing eligibility traces.
- Temporal Difference and why it is important.
When you take risks, you learn that there will be times when you succeed, and there will be times when you fail, and both are equally important.
Ellen DeGeneres, American comedian.
If we had to distill reinforcement learning down to its beating heart—the concepts without which the entire field would collapse—two ideas would stand out above all others: generalized policy iteration and temporal difference learning. These are not just passing technical details; they are the pillars on which almost every reinforcement learning algorithm rests. Whether it’s a simple method you can run on a whiteboard or a massive system like AlphaGo (an AI program developed by Google DeepMind that defeated human Go world champions), you’ll find traces of generalized policy iteration, temporal difference learning, or more often, both.
We’ve already taken the time to unpack generalized policy iteration: the elegant dance between policy evaluation and policy improvement. Now, it’s time to roll up our sleeves and dive into the other half of this story—the one idea I would confidently call the most important concept in reinforcement learning: temporal-difference learning.