chapter seven

7 Reward modeling

Reward models are core to the modern approach to RLHF. Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [1]. The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent’s reward function given trajectories of behavior [2], and other areas of deep reinforcement learning. Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [3].

The most common reward model predicts the probability that a piece of text was close to a "preferred" piece of text from the training comparisons. Later in this section we also compare these to Outcome Reward Models (ORMs) that predict the probability that a completion results in a correct answer or a Process Reward Model (PRM) that assigns a score to each step in reasoning. When not indicated, the reward models mentioned are those predicting preference between text.

7.1 Training Reward Models

There are two popular expressions for how to train a standard reward model for RLHF – they are numerically equivalent. The canonical implementation is derived from the Bradley-Terry model of preference [4]. A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say \(i\) and \(j\), satisfy the following relation, \(i > j\):

\[P(i > j) = \frac{p_i}{p_i + p_j}\]

7.2 Architecture

7.3 Implementation Example

7 Reward modeling

7.1 Training Reward Models

7.2 Architecture

7.3 Implementation Example

7.4 Variants

7.4.1 Preference Margin Loss

7.4.2 Balancing Multiple Comparisons Per Prompt

7.4.3 K-wise Loss Function

7.5 Outcome Reward Models

7.6 Process Reward Models

7.7 Reward Models vs. Outcome RMs vs. Process RMs vs. Value Functions

7.8 Generative Reward Modeling

7.9 Further Reading