chapter five

5 Exploring ONNX

 

This chapter covers

  • The ONNX standard format
  • The ONNX runtime
  • How ONNX can be useful for LLMs, with or without hardware acceleration

This chapter introduces the ONNX framework, which plays an important role in model optimization, quantization, and portability across frameworks and hardware vendors. If you’re new to ONNX, take the time to absorb the concepts in this chapter—­they’re used heavily in later chapters.

5.1 The ONNX format

ONNX (Open Neural Network Exchange, https://onnx.ai/) is an open standard for ML interoperability. First released in 2017, it is now a graduate project of the Linux Foundation for Artificial Intelligence (LFAI, https://lfaidata.foundation/). ONNX aims to improve interoperability across machine learning (ML) and deep learning (DL) frameworks (including Keras, TensorFlow, PyTorch, scikit-learn, XGBoost, and others) and to maximize performance across hardware accelerators (not just NVIDIA, but also Intel OpenVINO, Habana, Qualcomm, Apache TVM, Hugging Face Optimum, and more). Figure 5.1 gives a high-level view of the ONNX ecosystem, including the framework- and platform-agnostic model format and the optional ONNX Runtime.

Figure 5.1 ONNX overview

As ML and DL frameworks evolve, portability becomes critical—what we use today may not be what we’ll use tomorrow. ONNX is a robust open standard that reduces lock-in to specific frameworks and hardware accelerators, helping ensure an organization’s models remain usable over time.

5.2 ONNX operators and types

5.3 The ONNX runtime

5.4 ONNX runtime providers

5.5 ONNX for LLMs on CPU

5.6 ONNX for LLMs on GPU

5.6.1 ONNX for GPT on GPU

5.6.2 I/O binding

Summary