4 Running inference
This chapter covers
- Generating different types of content
- Calculating the cost of inference with a large model
- Identifying areas for performance improvements and cost savings
Chapters 2 and 3 presented examples of data preparation and SLM tuning. Now I’ll introduce SLM inference and offer tips for estimating GPU costs and finding performance and cost improvements.
4.1 How to generate content
In this chapter, we’ll use a language model as a reference for content generation. We’ll cover the key factors that drive execution speed, accuracy, and compute cost—background that you’ll use when applying the diverse quantization methods I’ll explain in chapters 5, 6, and 9.
Our reference open source model is GPT-Neo large (2.7 billion trainable parameters, https://huggingface.co/EleutherAI/gpt-neo-2.7B) from the EleutherAI research group (https://www.eleuther.ai/). Although GPT-Neo is no longer actively developed and has been replaced by newer models, it remains a useful example for explaining the topics in this chapter because it is a Transformer model that replicates the GPT-3 architecture. The topics in this chapter are not limited to GPT-Neo; they apply to any LLM available through the Hugging Face Transformers and Accelerate (https://github.com/huggingface/accelerate) libraries (both used in this example) and to domain-specific models trained on your own data. The code examples in this chapter are also provided in the two companion Colab notebooks.