chapter four

4 Retrieval-augmented generation for knowledge tasks

This chapter covers

How RAG combines pretrained retrieval and generation components
Using multiple documents with top-k retrieval
Choosing between RAG-Sequence and RAG-Token
Building a complete, working RAG pipeline
Why RAG became the standard for enterprise AI

Most developers today are familiar with naive RAG: retrieve a few documents, insert them into a prompt, and hope the LLM generates the correct answer. But the original 2020 paper by Patrick Lewis et al. proposed a much more robust approach. They introduced canonical RAG, a probabilistic framework that synthesizes answers by weighing evidence from multiple sources. To build reliable systems today, it’s important to understand this distinction.

Published in May 2020, just three months after REALM demonstrated end-to-end retrieval training, Patrick Lewis and his colleagues at Facebook AI, UCL, and NYU asked whether a modular system—combining pretrained dense retrieval with pretrained generation—could outperform both massive parametric models (like T5-11B) and end-to-end trained systems (like REALM) on knowledge-intensive tasks.

4.1 The RAG model architecture

4.1.1 The retriever

4.1.2 The generator

4.1.3 Modular architecture

4.2 Marginalization over retrieved documents

4.2.1 How marginalization works in practice

4.2.2 The importance of marginalization

4.3 RAG-sequence vs. RAG-token variations

4.3.1 RAG-sequence

4.3.2 RAG-token

4.3.3 RAG-sequence vs. RAG-token

4.4 Fine-tuning for downstream tasks

4.5 Case study: Open-domain question answering

4.5.1 Round 1: Exact answers

4.5.2 Round 2: Efficiency

4.5.3 Round 3: Knowledge update

4.5.4 Ablation studies: Why each component matters

4.5.5 Where RAG struggled

4.5.6 The verdict

4.6 Building a basic RAG pipeline

4.7 Canonical RAG vs. naive RAG

4.7.1 The RAG ecosystem

4.7.2 Limitations of naive RAG

4.8 Summary