Appendix E. Open-source LLMs

 

This appendix covers

  • Advantages of open-source LLMs: flexibility, transparency, and control.
  • Performance benchmarks and key features of leading open-source LLMs.
  • Challenges of local deployment and strategies to address them.
  • Selecting an optimal inference engine for your use case.

In earlier chapters, you worked with OpenAI's public REST API. It’s a straightforward way to build LLM applications since you don’t need to set up a local LLM host. After signing up with OpenAI and generating an API key, you can send requests to their endpoints and access LLM capabilities. This quick setup lets you work with state-of-the-art models like GPT-4o, 4o-mini, or o1 and o3 efficiently. The main drawback is cost—running examples like summarization might cost a few cents or even dollars. If you're working on projects for your company, privacy might also be a concern. Some employers block OpenAI entirely to avoid the risk of leaking sensitive or proprietary data.

This chapter introduces open-source LLMs, a practical solution for reducing costs and addressing privacy concerns. These models are especially appealing to individuals and organizations that prioritize data confidentiality or are new to AI. I’ll guide you through the most popular open-source LLM families, their features, and the advantages they offer. The focus will be on running these models, ranging from high-performance, advanced setups to user-friendly tools ideal for learning and experimentation.

E.1 Benefits of open-source LLMs

E.1.1 Transparency

E.1.2 Privacy

E.1.3 Community driven

E.1.4 Cost Savings

E.2 Popular open-source LLMs

E.3 Considerations on running open-source LLMs locally

E.3.1 Limitations of consumer hardware

E.3.2 Quantization

E.3.3 OpenAI REST API compatibility

E.4 Local inference engines

E.4.1 Llama.cpp

E.4.2 Ollama

E.4.3 vLLM

E.4.4 llamafile

E.4.5 LM Studio

E.4.6 LocalAI

E.4.7 GPT4All

E.4.8 Comparing local inference engines

E.4.9 Choosing a local inference engine

E.5 Inference via the HuggingFace Transformers library

E.5.1 Hugging Face Transformers library

E.5.2 LangChain’s HuggingFace Pipeline

E.6 Building a local summarization engine

E.6.1 Choosing the inference engine

E.6.2 Starting up the OpenAI compatible server

E.6.3 Modifying the original solution

E.6.4 Running the summarization engine through the local LLM

E.6.5 Comparison between OpenAI and local LLM

E.7 Summary