chapter three

3 The Model Service: Your platform’s gateway to AI models

 

This chapter covers

  • Defining the model service contract
  • Building provider adapters to call multiple models from a single interface
  • Enabling multimodal inputs and structured outputs
  • Implementing streaming responses
  • Configuring fallback chains, retry strategies and rate limiting for operational resilience
  • Applying routing patterns based on cost, load, and capability requirements
  • Caching responses and prompts

Every platform service we build supports a single goal: generating intelligent responses. The Session Service remembers conversations. The Data Service retrieves organizational knowledge. The Tool Service executes actions. But the Model Service is where these capabilities converge. It's the component that produces the AI's response. When Sarah's patient intake assistant answers a question, the Model Service orchestrates the entire interaction: assembling context from other services, selecting an appropriate provider, and transforming a user's message into a helpful reply.

3.1 The model service contract

3.1.1 Generating responses

3.1.2 Discovering available models

3.1.3 Managing system prompts

3.1.4 Registering custom models

3.1.5 The gRPC contract

3.1.6 Request and response structures

3.2 Provider abstraction

3.2.1 How providers differ

3.2.2 The unified provider interface

3.2.3 OpenAI message format as platform standard

3.2.4 What adapters do

3.2.5 The adapter implementation pattern

3.3 Streaming responses

3.3.1 The streaming architecture

3.3.2 The ChatChunk message

3.3.3 Streaming and error handling

3.4 Resilience: fallbacks and retries

3.4.1 Retry configuration

3.4.2 Fallback configuration

3.4.3 Configuration examples

3.5 Routing strategies

3.5.1 Routing configuration

3.5.2 Cost-aware routing

3.5.3 Load-based routing

3.5.4 Feature-based routing

3.5.5 Combining patterns

3.6 Rate limiting

3.7 Caching for cost and performance

3.7.1 Two levels of caching