chapter three

3 Indexing pipeline: Creating a knowledge base for RAG

This chapter covers

Data loading
Text splitting or chunking
Converting text to embeddings
Storing embeddings in vector databases
Examples in Python using LangChain

In chapter 2, we discussed the main components of retrieval-augmented generation (RAG) systems. You may recall that the indexing pipeline creates the knowledge base or the non-parametric memory of RAG applications. An indexing pipeline needs to be set up before the real-time user interaction with the large language model (LLM) can begin.

This chapter elaborates on the four components of the indexing pipeline. We begin by discussing data loading, which involves connecting to the source, extracting files, and parsing text. At this stage, we introduce a framework called LangChain, which has become increasingly popular in the LLM app developer community. Next, we elaborate on the need for data splitting or chunking and discuss chunking strategies. Embeddings is an important design pattern in the world of AI and ML. We explore embeddings in detail and how they are relevant in the RAG context. Finally, we look at a new storage technique called vector storage and the databases that facilitate it.

3.1 Data loading

3.2 Data splitting (chunking)

3.2.1 Advantages of chunking

3.2.2 Chunking process

3.2.3 Chunking methods

3.2.4 Choosing a chunking strategy

3.3 Data conversion (embeddings)

3.3.1 What are embeddings?

3.3.2 Common pre-trained embeddings models

3.3.3 Embeddings use cases

3.3.4 How to choose embeddings?

3.4 Storage (vector databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a vector database

Summary

Data loading