chapter four

4 Running inference

 

This chapter covers

  • Generating different types of content
  • Calculating the cost of inference with a large model
  • Identifying areas for performance improvements and cost savings

Chapters 2 and 3 presented examples of data preparation and SLM tuning. Now I’ll introduce SLM inference and offer tips for estimating GPU costs and finding performance and cost improvements.

4.1 How to generate content

In this chapter, we’ll use a language model as a reference for content generation. We’ll cover the key factors that drive execution speed, accuracy, and compute cost—background that you’ll use when applying the diverse quantization methods I’ll explain in chapters 5, 6, and 9.

Our reference open source model is GPT-Neo large (2.7 billion trainable parameters, https://huggingface.co/EleutherAI/gpt-neo-2.7B) from the EleutherAI research group (https://www.eleuther.ai/). Although GPT-Neo is no longer actively developed and has been replaced by newer models, it remains a useful example for explaining the topics in this chapter because it is a Transformer model that replicates the GPT-3 architecture. The topics in this chapter are not limited to GPT-Neo; they apply to any LLM available through the Hugging Face Transformers and Accelerate (https://github.com/huggingface/accelerate) libraries (both used in this example) and to domain-specific models trained on your own data. The code examples in this chapter are also provided in the two companion Colab notebooks.

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Calculating inference cost

4.3 Areas for improvement (cost savings and performance)

4.3.1 Getting the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU use with DeepSpeed

Summary