chapter eighteen

18 Over optimization

In the RLHF literature and discourse, there are two primary directions that over-optimization can emerge:

Quantitative research on the technical notion of over-optimization of reward. This measures optimization distance and power versus training metrics and downstream performance. Training keeps going up, while eventually downstream goes down.
Qualitative observations that "overdoing" RLHF can result in worse models. These are fundamental limitations in the RLHF problem setup, measurement tools, and trade-offs.

This chapter provides a cursory introduction to both. We begin with the latter, qualitative, because it motivates the problem to study further. Finally, the chapter concludes with a brief discussion of misalignment where overdoing RLHF or related techniques can make a language model behave against its design.

Over-optimization is a concept where the training metric ends up being mismatched from the final evaluations of interest. While similar to over-fitting – where one trains on data that is too narrow relative to the downstream evaluations that test generalization – over-optimization is used in the RL literature to indicate that an external signal is used too much. The cost of over-optimization is a lower alignment to real world goals or lower quality in any domain, and the shape of training associated with it is shown in Figure 18.1.

18 Over optimization

Figure 18.1 Over-optimization of an RL training run vs. downstream evaluations.

18.1 Qualitative Over-optimization

18.1.1 Managing Proxy Objectives

18.1.2 Over-refusal and "Too Much RLHF"

18.2 Quantitative over-optimization

18.3 Misalignment and the Role of RLHF