13 Context compression and pruning
This chapter covers
- Why long context hurts
- Compressing prompts with contrastive perplexity and reordering documents
- Pruning context using native attention
- Choosing extractive or abstractive methods
- Navigating the compression tradeoff
Let's say you retrieve the correct document. The generator receives the answer in its context window but still fails to answer the user's query. This is failure point 5, an extraction failure. The information exists in the prompt, but the model ignores it. It happens because long context dilutes attention. Liu et al. (2023; https://arxiv.org/abs/2307.03172) found GPT-3.5's multi-document QA accuracy fell by more than 20 points when the relevant passage sat in the middle of a 20-document context rather than at the start, even though every prompt contained the same information; the same U-shape held for explicitly long-context models. We met this "lost in the middle" effect in chapters 5, 8, and 10; this chapter is where we address it.
Providing long contexts as input to an API incurs high costs, slows down processing, and degrades reasoning quality. One challenge in RAG is that retrieved documents bury the relevant information inside a large amount of irrelevant text, leading to expensive and degraded responses. We address this by reducing the context after retrieval. The area is called context compression.