7 Reward modeling
Reward models are core to the modern approach to RLHF. Reward models broadly have been used extensively in reinforcement learning research as a proxy for environment rewards [1]. The practice is closely related to inverse reinforcement learning, where the problem is to approximate an agent’s reward function given trajectories of behavior [2], and other areas of deep reinforcement learning. Reward models were proposed, in their modern form, as a tool for studying the value alignment problem [3].
The most common reward model predicts the probability that a piece of text was close to a "preferred" piece of text from the training comparisons. Later in this section we also compare these to Outcome Reward Models (ORMs) that predict the probability that a completion results in a correct answer or a Process Reward Model (PRM) that assigns a score to each step in reasoning. When not indicated, the reward models mentioned are those predicting preference between text.
7.1 Training Reward Models
There are two popular expressions for how to train a standard reward model for RLHF – they are numerically equivalent. The canonical implementation is derived from the Bradley-Terry model of preference [4]. A Bradley-Terry model of preferences measures the probability that the pairwise comparison for two events drawn from the same distribution, say \(i\) and \(j\), satisfy the following relation, \(i > j\):
\[P(i > j) = \frac{p_i}{p_i + p_j}\]