3 Running Inference

 

This chapter explores the inference process for LLMs. Focus is in particular on the following

  • Generation of different types of content.
  • How to calculate the cost of doing inference with a large model.
  • Areas for performance improvement and cost savings to look at.

3.1 How to generate content

3.1.1 Text completion

3.1.2 Few-shot learning

3.1.3 Code generation

3.1.4 Evaluating the generated content

3.2 Inference cost calculation

3.3 Areas for improvement (cost savings and performance)

3.3.1 Get the most from your GPU

3.3.2 Batching

3.3.3 Estimating the generation time

3.3.4 Optimizing GPU usage with DeepSpeed

3.4 Summary