chapter twelve

12 Direct alignment algorithms

Direct Alignment Algorithms (DAAs) allow one to update models to solve the same RLHF objective without ever training an intermediate reward model or using reinforcement learning optimizers. The most prominent DAA and one that catalyzed an entire academic movement of aligning language models is Direct Preference Optimization (DPO) [1]. At its core, DPO is using gradient ascent to solve the same constrained RLHF objective. Since its release in May of 2023, after a brief delay where the community figured out the right data and hyperparameters to use DPO with (specifically, surprisingly low learning rates), many popular models have used DPO or its variants, from Zephyr-\(\beta\) kickstarting it in October of 2023 [2], Llama 3 Instruct [3], Tülu 2 [4] and 3 [5], Nemotron 4 340B [6], and others. Technically, Sequence Likelihood Calibration (SLiC-HF) was released first [7], but it did not catch on due to a combination of luck and effectiveness.

The most impactful part of DPO and DAAs is lowering the barrier of entry to experimenting with language model post-training.

12.1 Direct Preference Optimization (DPO)

Here we explain intuitions for how it works and re-derive the core equations fully.

12 Direct alignment algorithms

12.1 Direct Preference Optimization (DPO)

12.1.1 How DPO Works

12.1.2 DPO Derivation

12.2 Numerical Concerns, Weaknesses, and Alternatives

12.3 Implementation Considerations

12.4 DAAs vs. RL: Online vs. Offline Data