chapter six

6 Atlas: Few-shot learning with retrieval augmentation

This chapter covers

The trade-offs between internal and external knowledge
How smaller models can outperform larger ones using external knowledge
How to train the retriever based on the reader's performance
The Perplexity Distillation (PDist) algorithm for joint training
How Atlas achieved state-of-the-art few-shot learning performance

In the early 2020s, the field of Natural Language Processing was largely defined by the principle of scaling laws. Research from institutions like OpenAI and DeepMind demonstrated that increasing a model's parameter count, along with the volume of its training data, led to the emergence of impressive new capabilities, particularly in few-shot learning. Few-shot learning, or more generally N-shot learning, refers to a setup where a model performs a new task from just N worked examples placed directly in its prompt. With N=0 (zero-shot), the prompt contains only an instruction, e.g., "Translate to French: hello." With N=2, the prompt first shows two solved examples ("English: hello → French: bonjour. English: dog → French: chien.") and then asks for the next answer.

6.1 The parametric vs. non-parametric knowledge tradeoff

6.2 Atlas pre-training methodology

6.2.1 Joint training without labels

6.2.2 A deep dive into the training objectives

6.2.3 Code in practice

6.3 Dynamic knowledge retrieval

6.3.1 Head-to-head

6.3.2 The temporal sensitivity experiment

6.4 Efficient knowledge base updating

6.4.1 Three strategies for efficient updates

6.4.2 Compressing the index

6.5 Few-shot learning strategies for RAG systems

6.6 Case study: Enterprise knowledge adaptation

6.7 Summary