chapter five
5 Reward Modeling
This chapter covers
- How language models are trained to predict human preferences
- How to implement reward models in PyTorch
- The different varieties of reward models used today
Reward models are core to the modern approach to RLHF by being where the complex human preferences are learned. They are what enable our models to learn from hard to specify signals. They compress complex features in the data into a representation that can be used in downstream training – a sort of magic that once again shows the complex capacity of modern deep learning. These models act as the proxy objectives by which the core optimization is done, as studied in the following chapters.