chapter five

5 Reward Modeling

 

This chapter covers

  • How language models are trained to predict human preferences
  • How to implement reward models in PyTorch
  • The different varieties of reward models used today

Reward models are core to the modern approach to RLHF by being where the complex human preferences are learned. They are what enable our models to learn from hard to specify signals. They compress complex features in the data into a representation that can be used in downstream training – a sort of magic that once again shows the complex capacity of modern deep learning. These models act as the proxy objectives by which the core optimization is done, as studied in the following chapters.

5.1 Training Reward Models

5.2 Architecture

5.3 Implementation Example

5.4 Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

5.8 Generative Reward Modeling

5.9 Further Reading