chapter thirteen

13 Context compression and pruning

This chapter covers

Why long context hurts
Compressing prompts with contrastive perplexity and reordering documents
Pruning context using native attention
Choosing extractive or abstractive methods
Navigating the compression tradeoff

Let's say you retrieve the correct document. The generator receives the answer in its context window but still fails to answer the user's query. This is failure point 5, an extraction failure. The information exists in the prompt, but the model ignores it. It happens because long context dilutes attention. Liu et al. (2023; https://arxiv.org/abs/2307.03172) found GPT-3.5's multi-document QA accuracy fell by more than 20 points when the relevant passage sat in the middle of a 20-document context rather than at the start, even though every prompt contained the same information; the same U-shape held for explicitly long-context models. We met this "lost in the middle" effect in chapters 5, 8, and 10; this chapter is where we address it.

Providing long contexts as input to an API incurs high costs, slows down processing, and degrades reasoning quality. One challenge in RAG is that retrieved documents bury the relevant information inside a large amount of irrelevant text, leading to expensive and degraded responses. We address this by reducing the context after retrieval. The area is called context compression.

13.1 The costs of long context

13.1.1 Evolution of context compression

13.1.2 Types of compression

13.2 Perplexity-based compression with LongLLMLingua

13.2.1 Why absolute perplexity fails

13.2.2 Contrastive perplexity as a relevance signal

13.2.3 The LongLLMLingua pipeline

13.2.4 Implementing a prompt compressor with the LLMLingua

13.3 Native pruning with AttentionRAG

13.3.1 Beyond all-layer aggregation: Evaluator heads

13.4 Extensions: The RECOMP framework

13.4.1 CORE and multi-hop synthesis

13.5 Implementing a query-aware compressor

13.5.1 Recursive retrieval and index nodes

13.6 Case study: Legal document synthesis

13.7 Tradeoffs: Compression ratio vs. semantic loss

13.8 Future directions and adjacent research

13.8.1 Gist tokens and trained prompt compression

13.8.2 KV-cache eviction

13.8.3 Multi-modal retrieval

13.8.4 Hardware-aware budgeting

13.8.5 Agentic self-compression

13.9 Summary