Appendix E. Further inference engines
E.1 Additional inference engines
This appendix introduces two additional inference engines beyond those covered in chapter 10. While they are less distinctive than the ones discussed earlier, they are still worth exploring.
LocalAI acts as a router, connecting an OpenAI-compatible REST API to various backend engines, such as llama.cpp and vLLM.
GPT4All is similar to LM Studio. It offers an intuitive graphical user interface and is particularly user-friendly for developers with limited experience.
E.1.1 LocalAI
LocalAI is a free inference engine designed to run OpenAI REST API-compatible LLMs on consumer hardware, including systems with plain CPUs or low-grade GPUs. It supports various quantized open-source LLMs and can also handle audio-to-text, text-to-audio, and multi-modal models. Here, the focus is on its text LLM capabilities.
Written in Go, LocalAI serves as a higher-level inference engine that routes OpenAI-like REST API calls to back-end engines like llama.cpp or vLLM. Figure E.1, adapted from the LocalAI documentation, illustrates this architecture.
Figure E.1 LocalAI Architecture: LocalAI routes OpenAI-like REST API calls to inference engines like llama.cpp, vLLM, or other custom backends.

Server mode
The primary distribution method for LocalAI is through container images, which can be deployed using Docker, Podman, or Kubernetes. Popular models are automatically downloaded when starting the container.