11 Preference Data
This chapter covers
- The core data engine behind RLHF
- Why collecting human data can be extremely challenging
- Decisions to be made when collecting preference data
- Open questions in preference data
Preference data is the engine of preference fine-tuning and reinforcement learning from human feedback. The core problem we’ve been trying to solve with RLHF is that we cannot precisely model human rewards and preferences for AI models’ outputs – as in write clearly defined loss functions to optimize against – so preference data is the proxy signal we use to tune our models. The data is what allows us to match behaviors we desire and avoid some failure modes we hate. The data is so rich a source that it is difficult to replace this style of optimization at all. Within preference fine-tuning, many methods for collecting and using said data have been proposed, and given that human preferences cannot be captured in a clear reward function, many more will come to enable this process of collecting labeled preference data at the center of RLHF and related techniques. Today, two main challenges exist around preference data that are intertwined with this chapter: 1) operational complexity and cost of collection, and 2) the need for preference data to be collected on the generations from the model being trained (called “on-policy”);
In this chapter, we detail technical decisions on how the data is formatted and organizational practices for collecting it.