11 Using open-source LLMs

 

This chapter covers

  • Advantages of open-source LLMs: flexibility, transparency, and control.
  • Performance benchmarks and key features of leading open-source LLMs.
  • Challenges of local deployment and strategies to address them.
  • Selecting an optimal inference engine for your use case.

In earlier chapters, you worked with OpenAI's public REST API. It’s a straightforward way to build LLM applications since you don’t need to set up a local LLM host. After signing up with OpenAI and generating an API key, you can send requests to their endpoints and access LLM capabilities. This quick setup lets you work with state-of-the-art models like GPT-4o, 4o-mini, or o1 and o3 efficiently. The main drawback is cost—running examples like summarization might cost a few cents or even dollars. If you're working on projects for your company, privacy might also be a concern. Some employers block OpenAI entirely to avoid the risk of leaking sensitive or proprietary data.

This chapter introduces open-source LLMs, a practical solution for reducing costs and addressing privacy concerns. These models are especially appealing to individuals and organizations that prioritize data confidentiality or are new to AI. I’ll guide you through the most popular open-source LLM families, their features, and the advantages they offer. The focus will be on running these models, ranging from high-performance, advanced setups to user-friendly tools ideal for learning and experimentation.

11.1 Benefits of open-source LLMs

11.1.1 Transparency

11.1.2 Privacy

11.1.3 Community driven

11.1.4 Cost Savings

11.2 Popular open-source LLMs

11.3 Considerations on running open-source LLMs locally

11.3.1 Limitations of consumer hardware

11.3.2 Quantization

11.3.3 OpenAI REST API compatibility

11.4 Local inference engines

11.4.1 Llama.cpp

11.4.2 Ollama

11.4.3 vLLM

11.4.4 llamafile

11.4.5 LM Studio

11.4.6 Further inference engines

11.4.7 Comparing local inference engines

11.4.8 Choosing a local inference engine

11.5 Inference via the HuggingFace Transformers library

11.5.1 Hugging Face Transformers library

11.5.2 LangChain’s HuggingFace Pipeline

11.6 Building a local summarization engine

11.6.1 Choosing the inference engine

11.6.2 Starting up the OpenAI compatible server

11.6.3 Modifying the original solution

11.6.4 Running the summarization engine through the local LLM

11.6.5 Comparison between OpenAI and local LLM

11.7 Summary