chapter eleven

11 Deployment and Serving

 

This chapter covers

  • Small Language Model (SLM) deployment and inference
  • vLLM
  • FastAPI
  • MLC LLM
  • Android devices

This chapter deep dives into some of most probable target environments/tools for deploying and serving small, customized language models. The list here isn’t meant to be comprehensive, as the more you move local and/or to the edge, the more hardware options you encounter, and the more frameworks/libraries to pick up from: so, I have selected those that are currently the most popular across multiple operating systems and hardware combinations.

11.1 vLLM

11.1.1 Offline serving

11.1.2 Online serving

11.2 FastAPI

11.2.1 Benchmarking various models

11.2.2 Deploy most performance model with FastAPI

11.3 MLC LLM

11.4 Deployment and inference on Android devices

11.4.1 MLC LLM Framework

11.4.2 MLLM Framework

11.4.3 HF’s Transformers

11.5 Summary