9 Self-RAG: Retrieval with reflection and self-critique
This chapter covers
- Moving from passive retrieval to active reasoning
- Using reflection tokens for self-critique
- Training generators to emulate proprietary critics
- Controlling model behavior at inference time
By late 2023, RAG had established itself as the dominant pattern for working around the static knowledge baked into LLMs. Lewis et al. (2020) and Fusion-in-Decoder (Izacard and Grave, 2021) provided the mechanics for connecting parametric memory (the model's weights) with non-parametric memory (external vector databases). As practitioners pushed these systems from research prototypes into production, a recurring set of limitations, often called the "Naive RAG" bottlenecks, surfaced. The standard architecture was passive: it performed a vector search for every user query regardless of necessity.
That indiscriminate approach kept manufacturing the same families of failures from the Barnett et al. taxonomy in chapter 1 (table 1.1): wasted compute on queries that didn't need retrieval, polluted context windows that crowded out the answer, and ungrounded hallucinations like the one behind the Air Canada incident. We’ll walk through each failure mode and maps it onto the specific failure points Self-RAG was designed to neutralize.