chapter three

3 Indexing Pipeline: Creating a Knowledge Base for RAG-based Applications

 

This chapter covers

  • The four components of Indexing Pipeline
  • Data Loading
  • Text Splitting or Chunking
  • Converting Text to Embeddings
  • Storing Embeddings in Vector Databases
  • Examples in Python using LangChain.

In Chapter 2, we discussed the main components of RAG-based system design. You may recall that the Indexing Pipeline creates the knowledge base or the non-parametric memory of RAG-based applications. Indexing Pipeline needs to be set-up before the real-time user interaction with the LLM can begin.

In this chapter, we will elaborate the four components of Indexing Pipeline. We will begin by discussing Data Loading which involves connecting to source, extracting files and parsing text. At this stage, we will introduce a framework called LangChain, that is fast becoming popular in the LLM app developer community. We will then elaborate on the need for data splitting or chunking and discuss chunking strategies. Embeddings is an important design pattern in the world of AI & ML. We will detail out what embeddings are and how they are relevant in the context of RAG. Finally, we will look at a new storage technique called Vector Storage and the databases that facilitate this storage.

3.1 Data Loading

3.2 Data Splitting (Chunking)

3.2.1 Advantages of Chunking

3.2.2 Chunking Process

3.2.3 Chunking Methods

3.2.4 Choice of Chunking Strategy

3.3 Data Conversion (Embeddings)

3.3.1 What are Embeddings?

3.3.2 Common Pre-trained Embeddings Models

3.3.3 Embeddings Use Cases

3.3.4 How to choose embeddings?

3.4 Storage (Vector Databases)

3.4.1 What are vector databases?

3.4.2 Types of vector databases

3.4.3 Choosing a Vector Database

3.5 Summary