chapter ten

10 Profiling insights

 

This chapter covers

  • Profiling ONNX ported models
  • Transforming raw ONNX profiling data into insights
  • Optimizing ONNX graphs for LLMs

Chapter 5 introduced the ONNX format and ONNX Runtime capabilities, first in general and then for LLMs. Chapter 6 covered several approaches, including 8-bit quantization through the ONNX API. This chapter will explore other ONNX features, such as profiling the performance of LLMs ported to ONNX format and tools you can use to extract useful insights from the profiling data.

10.1 Profiling ONNX-ported LLMs

The ONNX Runtime (ORT) delivers high performance for running machine learning (ML) and deep learning (DL) models across a wide range of hardware. Still, to meet specific key performance indicators (KPIs) for latency, throughput, and memory use, you may need additional model optimizations and runtime configuration for a given use case, model, and device.

The ORT supports in-code performance profiling. It’s disabled by default, but you can enable it during debugging by setting the enable_profiling option to True in an ORT inference session:

import onnxruntime as rt

sess_options = rt.SessionOptions()
sess_options.enable_profiling = True

At runtime, ORT generates a JSON file with performance data such as threading and per-operator latency.

Let’s take a look at what’s in this profiling file, starting with a simple model and then moving on to LLMs. We’ll reuse the linear regression example from section 5.2:

10.2 Transforming raw ONNX profiling data into insights

10.3 Optimization of ONNX graphs for LLMs

Summary