10 Deployment and Serving

 

This chapter covers

  • Small Language Model (SLM) deployment and inference
  • vLLM
  • FastAPI
  • MLC LLM
  • Android devices

This chapter deep dives into some of most probable target environments/tools for deploying and serving small, customized language models. The list here isn’t meant to be comprehensive, as the more you move local and/or to the edge, the more hardware options you encounter, and the more frameworks/libraries to pick up from: so, I have selected those that are currently the most popular across multiple operating systems and hardware combinations.

10.1 vLLM

10.1.1 Offline serving

10.1.2 Online serving

10.2 FastAPI

10.2.1 Benchmarking various models

10.2.2 Deploy most performance model with FastAPI

10.3 MLC LLM

10.4 Deployment and inference on Android devices

10.4.1 MLC LLM Framework

10.4.2 MLLM Framework

10.4.3 HF’s Transformers

10.5 Summary