chapter ten

10 Profiling Insights

This chapter covers

Profiling ONNX ported models.
Transforming raw ONNX profiling data into insights.
Optimizing ONNX graphs for LLMs.

Chapter 5 covered the ONNX format and the ONNX Runtime capabilities (in general first, then in particular with reference to LLMs), while chapter 6 detailed, among other methods, the possibility to perform 8-bit quantization through the ONNX API. This chapter explores other ONNX capabilities, such as the profiling of LLMs which have been ported to the ONNX format and utilities to get useful insights from the raw profiling data.

10.1 Profiling ONNX ported LLMs

In chapters 5 and 6 we learned that the ONNX Runtime provides high performance for running Machine Learning/Deep Learning models on a wide range of hardware. But there are extra model optimization techniques and runtime configurations that may be needed to improve performance for specific use cases, models, and hardware/devices, depending on the given KPIs about latency, throughput, and memory utilization.

The ONNX runtime (ORT) allows in-code performance profiling. By default, this kind of profiling is disabled, but it can be set at debugging time as follows:

import onnxruntime as rt
 
sess_options = rt.SessionOptions()
sess_options.enable_profiling = True

10 Profiling Insights

This chapter covers

10.1 Profiling ONNX ported LLMs

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

10.4 Summary