chapter five

5 The Data Service: teaching AI what your organization knows

 

This chapter covers

  • Designing the Data Service to give teams searchable knowledge indexes without building their own parsing, chunking, and embedding pipelines
  • Organizing knowledge into isolated indexes so teams configure retrieval independently
  • Building an ingestion pipeline that detects file formats, extracts text, and chunks documents into searchable pieces
  • Generating embeddings through the Model Service to reuse provider abstraction, fallback logic, and cost tracking
  • Abstracting vector storage and search to support multiple backends, with a complete pgvector implementation
  • Supporting hybrid retrieval by extending the vector store interface with optional keyword search
  • Exposing the Data Service through the gRPC contract and platform SDK

An AI assistant that remembers your conversation but doesn't know your company's policies, products, or procedures is still going to make things up. It will hallucinate confidently about return windows, invent product features, and cite policies that don't exist. Conversational memory, which we built in Chapter 4, is only half the story. The other half is grounding: connecting AI applications to organizational knowledge so that responses reflect reality rather than plausible guesses.

5.1 From documents to searchable knowledge

5.2 Indexes: organizing knowledge

5.2.1 Why isolation matters

5.2.2 Index configuration

5.2.3 Index operations

5.3 Ingestion pipeline: from raw files to vectors

5.3.1 The challenge of diverse formats

5.3.2 Pipeline architecture

5.3.3 Format detection and text extraction

5.3.4 Metadata: the filtering foundation

5.3.5 Chunking: breaking text into retrievable pieces

5.3.6 Generating embeddings

5.3.7 Document lifecycle

5.3.8 Document management

5.3.9 Asynchronous ingestion

5.4 Vector storage and search

5.4.1 Vector store interface

5.4.2 Choosing a vector store backend

5.4.3 The pgvector implementation

5.4.4 Search orchestration

5.5 Hybrid search: combining vectors with keywords

5.5.1 Adding keyword search to the platform

5.5.2 PostgreSQL keyword search implementation

5.5.3 Merging results: Reciprocal Rank Fusion

5.5.4 Putting it together

5.6 Service contract and complete retrieval flow

5.7 Summary