chapter eleven

11 Deployment and Serving

This chapter covers

Small Language Model (SLM) deployment and inference
vLLM
FastAPI
MLC LLM
Android devices

This chapter deep dives into some of most probable target environments/tools for deploying and serving small, customized language models. The list here isn’t meant to be comprehensive, as the more you move local and/or to the edge, the more hardware options you encounter, and the more frameworks/libraries to pick up from: so, I have selected those that are currently the most popular across multiple operating systems and hardware combinations.

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary