chapter five

5 Reward Modeling

This chapter covers

How language models are trained to predict human preferences
How reward models are architected and implemented
The different varieties of reward models used today

Reward models are core to the modern approach to RLHF by being where the complex human preferences are learned. They are what enable our models to learn from hard to specify signals. They compress complex features in the data into a representation that can be used in downstream training – a sort of magic that once again shows the complex capacity of modern deep learning. These models act as proxy objectives for the core optimization, as studied in the following chapters.

5.1 Training a Bradley-Terry Reward Model

5.2 The Default Reward Model Architecture

5.3 Implementation Example

5.4 Reward Model Variants

5.4.1 Preference Margin Loss

5.4.2 Balancing Multiple Comparisons Per Prompt

5.4.3 K-wise Loss Function

5.5 Outcome Reward Models

5.6 Process Reward Models

5.7 Comparing Reward Model Types (and Value Functions)

5.7.1 Inference Across Reward Model Types

5.8 Generative Reward Modeling (a.k.a. LLM-as-a-judge)

5.9 Further Reading

Summary