This chapter covers
- Data loading
- Text splitting or chunking
- Converting text to embeddings
- Storing embeddings in vector databases
- Examples in Python using LangChain
In chapter 2, we discussed the main components of retrieval-augmented generation (RAG) systems. You may recall that the indexing pipeline creates the knowledge base or the non-parametric memory of RAG applications. An indexing pipeline needs to be set up before the real-time user interaction with the large language model (LLM) can begin.
This chapter elaborates on the four components of the indexing pipeline. We begin by discussing data loading, which involves connecting to the source, extracting files, and parsing text. At this stage, we introduce a framework called LangChain, which has become increasingly popular in the LLM app developer community. Next, we elaborate on the need for data splitting or chunking and discuss chunking strategies. Embeddings is an important design pattern in the world of AI and ML. We explore embeddings in detail and how they are relevant in the RAG context. Finally, we look at a new storage technique called vector storage and the databases that facilitate it.