chapter five

5 Exploring ONNX

This chapter covers

The ONNX standard format
The ONNX runtime
How ONNX can be useful for LLMs, with or without hardware acceleration

It introduces you to the ONNX framework, which plays a very important role in model optimization, quantization and portability across frameworks and hardware vendors. Unless you are already familiar with ONNX, please dedicate all the time you need to assimilate the concepts explained in this chapter as they will be used heavily across the next chapters.

5.1 The ONNX format

ONNX (which stands for Open Neural Network Exchange, https://onnx.ai/) is an open standard for ML interoperability. Released for the first time in 2017, it is now a graduate project of the LFAI (Linux Foundation for Artificial Intelligence, https://lfaidata.foundation/). The aim of this initiative is making interoperability easier across diverse ML/DL frameworks (it supports all those currently available, such as Keras, TensorFlow, PyTorch, SciKit Learn, XGBoost and many others) and performance maximization across diverse hardware accelerators (many supported, not just NVIDIA, also Intel’s OpenVINO and Habana, Qualcomm, Apache TVM, Hugging Face’s Optimum and others). Figure 5.1 shows a very high-level overview of ONNX.

Figure 5.1 ONNX overview.

5 Exploring ONNX

This chapter covers

5.1 The ONNX format

Figure 5.1 ONNX overview.

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

5.7 Summary