chapter three

3 Naive kernels

This chapter covers

Naive but correct CUDA kernels.
dim3 to launch 2D and 3D thread arrays
Implementing naive matrix transpose, GEMM, softmax
Sliding-window operations: Convolutions and Pooling

In the last chapter, we crossed a major threshold. We went from being a passenger on the .to("cuda") express to getting into the driver’s seat.

This chapter is where we put that knowledge to work. We’re moving from the classical "Hello, World" vector addition kernel to the real substance of deep learning. Our goal is to get some serious practice by implementing several of the most important operations in neural networks from scratch. This chapter serves as a "road map" where we will build naive versions of six key kernels:

Matrix Transpose: A fundamental data-reshaping operation.
GEMM (General Matrix Multiplication): The computational heart of every dense layer.
Softmax: The essential final activation function for classification.
1D Convolution: The core of processing sequential data like time series.
2D Convolution: The engine of modern computer vision.
2D Max Pooling: A critical downsampling and feature-invariance operation.

3.1 From 1D to 3D Element-wise Addition

3.1.1 From Loops to Parallel Threads

3 Naive kernels

This chapter covers

3.1 From 1D to 3D Element-wise Addition

3.1.1 From Loops to Parallel Threads

3.1.2 Kernel 1: Matrix Transpose - When Data Needs to Change Shape

3.2 Core Neural Network Kernels

3.2.1 Kernel 2: General Matrix Multiplication (GEMM) - The Heart of Neural Networks

3.2.2 Kernel 3: Softmax - The Final Activation

3.3 Convolution and Pooling Toolkit

3.3.1 Kernel 4: 1D Convolution - Processing Signals and Sequences

3.3.2 Kernel 5: 2D Convolution - The Engine of Computer Vision

3.3.3 Kernel 6: 2D Max Pooling - Downsampling and Feature Detection

3.4 Summary