appendix-a

Appendix A. Setup, Code, and Reference

A.1 Getting started: installation and setup

This appendix provides everything you need to set up your environment and run the code examples from this book. Whether you’re working with a local GPU, a cloud instance, or a multi-node cluster, we’ll get you up and running.

A.1.1 Prerequisites

Before installing anything, confirm your machine meets the hardware bar and that your OS and toolchain are supported.

Hardware requirements

Minimum requirements (for basic chapters):

NVIDIA GPU with CUDA support (compute capability 7.0+)
8GB+ GPU memory (VRAM)
Examples: GTX 1080, RTX 2060, RTX 3060, or newer

Recommended for all chapters:

NVIDIA GPU with Tensor Cores (compute capability 8.0+)
16GB+ GPU memory
Examples: RTX 3090, RTX 4090, A100, H100

Chapter-specific requirements:

Chapters 1-6: Any CUDA-capable GPU
Chapter 7 (WMMA Tensor Cores): Volta or later (V100, RTX 20+ series)
Chapter 7 (WGMMA Tensor Cores): Hopper architecture (H100)
Chapter 10 (Distributed, Single Node): 8× GPU server (H100 recommended)
Chapter 10 (Distributed, Multi-Node): 2+ servers with 8× GPUs each, connected via InfiniBand or 100+ GbE
Chapter 11 (CUTLASS): Hopper (H100) for advanced features

Software prerequisites

You will need the following on the host before installing CUDA:

Operating System: Linux (Ubuntu 22.04 LTS recommended) or WSL2 on Windows
Python: 3.8 or later
C/C++ Compiler: GCC 9+ or Clang
Build Tools: GNU Make, CMake (optional)

Knowledge prerequisites:

Appendix A. Setup, Code, and Reference

A.1 Getting started: installation and setup

A.1.1 Prerequisites

Hardware requirements

Software prerequisites

A.1.2 Core installation

A.1.3 Cloud GPU options

A.2 Running the code examples

A.2.1 Chapter 2: Vector addition (0_vecadd/)