5 Preference Alignment and RAG

 

This chapter covers

  • Reinforcement learning from human feedback (RLHF)
  • Direct preference optimization (DPO)
  • Group-robust alignment (GRPO)
  • Retrieval-augmented generation (RAG) for factual grounding

As we’ve seen, decoding strategies and prompting techniques can guide a language model’s output at inference time. These methods do not change the model’s underlying parameters or architecture but significantly influence the diversity, fluency, and usefulness of its generated text. In this chapter, we shift focus to techniques that align a language model more directly with user intent. Either by training the model to prefer certain outputs through reinforcement learning and preference modeling, or by augmenting its context at inference time with external, up-to-date information.

We begin with preference alignment using Reinforcement Learning from Human Feedback (RLHF), Direct Preference Optimization (DPO), and Group Relative Policy Optimization (GRPO). These methods guide the model to produce outputs that better reflect human values, task-specific expectations and reasoning. Then, we cover knowledge alignment via Retrieval-Augmented Generation (RAG), which allows a model to dynamically incorporate factual and domain-specific information at runtime—without changing the model weights.

Together, these techniques form the foundation for controlling, specializing, and grounding large language models in real-world applications.

5.1 Reinforcement learning from human feedback

5.1.1 From Markov Decision Processes to Reinforcement Learning

5.1.2 Improving models with human feedback and reinforcement learning

5.2 Aligning LLMs with direct preference optimization

5.2.1 The SFT step

5.2.2 Training the LLM with DPO

5.2.3 Running the inference on the trained LLM

5.2.4 Optimized versions for DPO

5.2.5 Group Relative Policy Optimization (GRPO)

5.3 MixEval: A benchmark for robust and cost-efficient evaluation

5.4 Retrieval-augmented generation (RAG)

5.4.1 A first look at RAG

5.4.2 Why and when to use RAG

5.4.3 Core Components and Design Choices

5.5 Summary