18 Over optimization
In the RLHF literature and discourse, there are two primary directions that over-optimization can emerge:
- Quantitative research on the technical notion of over-optimization of reward. This measures optimization distance and power versus training metrics and downstream performance. Training keeps going up, while eventually downstream goes down.
- Qualitative observations that "overdoing" RLHF can result in worse models. These are fundamental limitations in the RLHF problem setup, measurement tools, and trade-offs.
This chapter provides a cursory introduction to both. We begin with the latter, qualitative, because it motivates the problem to study further. Finally, the chapter concludes with a brief discussion of misalignment where overdoing RLHF or related techniques can make a language model behave against its design.
Over-optimization is a concept where the training metric ends up being mismatched from the final evaluations of interest. While similar to over-fitting – where one trains on data that is too narrow relative to the downstream evaluations that test generalization – over-optimization is used in the RL literature to indicate that an external signal is used too much. The cost of over-optimization is a lower alignment to real world goals or lower quality in any domain, and the shape of training associated with it is shown in Figure 18.1.
Figure 18.1 Over-optimization of an RL training run vs. downstream evaluations.