3 Indexing Pipeline: Creating a Knowledge Base for RAG-based Applications
This chapter covers
- The four components of Indexing Pipeline
- Data Loading
- Text Splitting or Chunking
- Converting Text to Embeddings
- Storing Embeddings in Vector Databases
- Examples in Python using LangChain.
In Chapter 2, we discussed the main components of RAG-based system design. You may recall that the Indexing Pipeline creates the knowledge base or the non-parametric memory of RAG-based applications. Indexing Pipeline needs to be set-up before the real-time user interaction with the LLM can begin.
In this chapter, we will elaborate the four components of Indexing Pipeline. We will begin by discussing Data Loading which involves connecting to source, extracting files and parsing text. At this stage, we will introduce a framework called LangChain, that is fast becoming popular in the LLM app developer community. We will then elaborate on the need for data splitting or chunking and discuss chunking strategies. Embeddings is an important design pattern in the world of AI & ML. We will detail out what embeddings are and how they are relevant in the context of RAG. Finally, we will look at a new storage technique called Vector Storage and the databases that facilitate this storage.