16 Over Optimization
This chapter covers
- What over-optimization is and how to recognize it
- Qualitative failures: over-refusal, sycophancy, and repetitiveness
- The relationship between KL distance and model quality
A core lesson one learns when using reinforcement learning heavily in their domain is that it is a very strong optimizer, which causes it to pull all the possible increase in reward out of the environment. In modern ML systems, especially with language models, we’re using somewhat contrived notions of environment where the models generate completions (the actions) and an external verifier, i.e. a reward model or a scoring function provides feedback. In this domain, it is common for over-optimization to occur, where the RL optimizers push the language models in directions where the generations satisfy our checker functions, but the behavior does not align with our training goals. This chapter provides an overview of this classic case of over-optimization.