appendix-a

Appendix A. Setup, code, and reference

 

A.1 Getting started: installation and setup

This appendix provides everything you need to set up your environment and run the code examples from this book. Whether you’re working with a local GPU, a cloud instance, or a multi-node cluster, we’ll get you up and running.

A.1.1 Prerequisites

Before installing anything, confirm your machine meets the hardware bar and that your OS and toolchain are supported.

Hardware requirements

Minimum Requirements (for basic chapters):

  • NVIDIA GPU with CUDA support (compute capability 7.0+)
  • 8GB+ GPU memory (VRAM)
  • Examples: GTX 1080, RTX 2060, RTX 3060, or newer

Recommended for All Chapters:

  • NVIDIA GPU with Tensor Cores (compute capability 8.0+)
  • 16GB+ GPU memory
  • Examples: RTX 3090, RTX 4090, A100, H100

Chapter-Specific Requirements:

  • Chapters 1-6: Any CUDA-capable GPU
  • Chapter 7 (WMMA Tensor Cores): Volta or later (V100, RTX 20+ series)
  • Chapter 7 (WGMMA Tensor Cores): Hopper architecture (H100)
  • Chapter 10 (Distributed, Single Node): 8× GPU server (H100 recommended)
  • Chapter 10 (Distributed, Multi-Node): 2+ servers with 8× GPUs each, connected via InfiniBand or 100+ GbE
  • Chapter 11 (CUTLASS): Hopper (H100) for advanced features

Software prerequisites

  • Operating System: Linux (Ubuntu 22.04 LTS recommended) or WSL2 on Windows
  • Python: 3.8 or later
  • C/C++ Compiler: GCC 9+ or Clang
  • Build Tools: GNU Make, CMake (optional)

Knowledge Prerequisites:

A.1.2 Core installation

A.1.3 Cloud GPU options

A.2 Running the code examples

A.2.1 Chapter 2: Vector addition (0_vecadd/)

A.2.2 Chapter 3: Naive neural network operations (1_naive/)