chapter eleven

11 Deployment and serving

 

This chapter covers

  • SLM serving and inference with vLLM
  • SLM serving with FastAPI
  • SLM deployment and serving on devices with MLC LLM
  • Options for SLM deployment and inference on Android devices

It’s time to look at some of the most common environments and tools for deploying and serving small, customized language models. We won’t cover them all: the closer you get to local or edge deployments, the more hardware options and frameworks you’ll encounter. I’ve focused on the options that are currently most popular across operating systems and hardware combinations.

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploying the best-performing model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM framework

11.4.2 MLLM framework

11.4.3 Hugging Face Transformers

Summary