chapter three

3 Indexing Pipeline: Creating a Knowledge Base for RAG-based Applications

This chapter covers

The four components of Indexing Pipeline
Data Loading
Text Splitting or Chunking
Converting Text to Embeddings
Storing Embeddings in Vector Databases
Examples in Python using LangChain.

In Chapter 2, we discussed the main components of RAG-based system design. You may recall that the Indexing Pipeline creates the knowledge base or the non-parametric memory of RAG-based applications. Indexing Pipeline needs to be set-up before the real-time user interaction with the LLM can begin.

In this chapter, we will elaborate the four components of Indexing Pipeline. We will begin by discussing Data Loading which involves connecting to source, extracting files and parsing text. At this stage, we will introduce a framework called LangChain, that is fast becoming popular in the LLM app developer community. We will then elaborate on the need for data splitting or chunking and discuss chunking strategies. Embeddings is an important design pattern in the world of AI & ML. We will detail out what embeddings are and how they are relevant in the context of RAG. Finally, we will look at a new storage technique called Vector Storage and the databases that facilitate this storage.

3.1 Data Loading

3.2 Data Splitting (Chunking)

3.2.1 Advantages of Chunking

3.2.2 Chunking Process

3.2.3 Chunking Methods

3.2.4 Choice of Chunking Strategy

3.3 Data Conversion (Embeddings)

3.3.1 What are Embeddings?

3.3.2 Common Pre-trained Embeddings Models

3.3.3 Embeddings Use Cases

3.3.4 How to choose embeddings?

3.4 Storage (Vector Databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a Vector Database

3.5 Summary