chapter five

5 Preference alignment and retrieval-augmented generation

This chapter covers

Reinforcement learning from human feedback
Direct preference optimization
Group-robust alignment
Retrieval-augmented generation for factual grounding

As we’ve seen, decoding strategies and prompting techniques can guide a language model’s output at inference time. These methods do not change the model’s underlying parameters or architecture but significantly influence the diversity, fluency, and usefulness of its generated text. In this chapter, we shift focus to techniques that align a language model more directly with user intent—either by training the model to prefer certain outputs through reinforcement learning and preference modeling or by augmenting its context at inference time with external, up-to-date information.

We begin with preference alignment using reinforcement learning from human feedback (RLHF), direct preference optimization (DPO), and group relative policy optimization (GRPO). These methods guide the model to produce outputs that better reflect human values, task-specific expectations, and reasoning. Then we cover knowledge alignment via retrieval-augmented generation (RAG), which allows a model to dynamically incorporate factual and domain-specific information at runtime—without changing the model weights.

Together, these techniques form the foundation for controlling, specializing, and grounding large language models (LLMs) in real-world applications.

5.1 Reinforcement learning from human feedback

5.1.1 From MDP to reinforcement learning

5.1.2 Improving models with human feedback and reinforcement learning

5.2 Aligning LLMs with direct preference optimization

5 Preference alignment and retrieval-augmented generation

This chapter covers

5.1 Reinforcement learning from human feedback

5.1.1 From MDP to reinforcement learning

5.1.2 Improving models with human feedback and reinforcement learning

5.2 Aligning LLMs with direct preference optimization

5.2.1 The SFT step

5.2.2 Training the LLM with DPO

5.2.3 Running the inference on the trained LLM

5.2.4 Optimized versions for DPO

5.2.5 Group Relative Policy Optimization

5.3 MixEval: A benchmark for robust and cost-efficient evaluation

5.4 Retrieval-augmented generation

5.4.1 A first look at RAG

5.4.2 Why and when to use RAG

Summary