8 Direct Alignment Algorithms
This chapter covers
- The derivation of the first DAA, Direct Preference Optimization (DPO), from first principles
- Intuitions for DPO and other related algorithms
- What to consider when using DAAs yourself
Direct Alignment Algorithms (DAAs) allow one to update models to solve the same RLHF objective, shown again in eq. 8.1, without ever training an intermediate reward model or using reinforcement learning optimizers. It solves the same preference learning problem we’ve been studying (with literally the same data!), in order to make language models more aligned, smarter, and easier to use. The lack of a reward model and online optimization makes DAAs far simpler to implement, reducing compute spent during training and making experimentation easier. This chapter details the complex mathematics that were done to derive these algorithms, and then shows that the sometimes tedious derivations result in simple implementations.
The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [1]. At its core, DPO is using gradient ascent to solve the same constrained RLHF objective (see Chapter 3):