chapter four

4 Generation pipeline: Generating contextual LLM responses

This chapter covers

Retrievers and retrieval methodologies
Augmentation using prompt engineering techniques
Generation using LLMs
Basic implementation of the RAG pipeline in Python

In chapter 3, we discussed the creation of the knowledge base, or the non-parametric memory of retrieval augmented generation (RAG)-based applications, via the indexing pipeline. To use this knowledge base for accurate and contextual responses, we need to create a generation pipeline that includes the steps of retrieval, augmentation, and generation.

This chapter elaborates on the three components of the generation pipeline. We begin by discussing the retrieval process, which primarily involves searching through the embeddings stored in vector databases of the knowledge base and returning a list of documents that closely match the input query of the user. You will also learn about the concept of retrievers and a few retrieval algorithms. Next, we move to the augmentation step. At this point, it is also beneficial to understand different prompt engineering frameworks used with RAG. Finally, as part of the generation step, we discuss a few stages of the LLM life cycle, such as using foundation models versus supervised fine-tuning, models of different sizes, and open source versus proprietary models in the RAG context. In each of these steps, we also highlight the benefits and drawbacks of different methods.

4.1 Generation pipeline overview

4.2 Retrieval

4.2.1 Progression of retrieval methods

4.2.2 Popular retrievers

4.2.3 A simple retriever implementation

4.3 Augmentation

4.3.1 RAG prompt engineering techniques

4.3.2 A simple augmentation prompt creation

4.4 Generation

4.4.1 Categorization of LLMs and suitability for RAG

4.4.2 Completing the RAG pipeline: Generation using LLMs

Summary

Retrieval

Augmentation

Generation