chapter four

4 Running Inference

This chapter covers

Generation of different types of content.
How to calculate the cost of doing inference with a large model.
Areas for performance improvement and cost savings to look at.

In chapters 2 and 3 examples of data preparation and tuning of SLMs have been provided. This chapter now introduces you to the SLM inference space and provides some tips on how to understand how much you are going to spend in terms of GPU power and where to look for potential performance and cost improvements.

4.1 How to generate content

4.1.1 Text completion

4.1.2 Few-shot learning

4.1.3 Code generation

4.1.4 Evaluating the generated content

4.2 Inference cost calculation

4.3 Areas for improvement (cost savings and performance)

4.3.1 Get the most from your GPU

4.3.2 Batching

4.3.3 Estimating the generation time

4.3.4 Optimizing GPU usage with DeepSpeed

4.4 Summary