4 Training and verifying neural nets in raw CUDA
This chapter covers
- The challenge of building and verifying a neural network from scratch to achieve reproducibility.
- Establishing a "gold standard" implementation in PyTorch.
- Using NumPy and C to peel back layers of abstraction.
- Building a naive CUDA MLP from our C blueprint.
- Optimizing the CUDA implementation with the cuBLAS library.
- Techniques for debugging and profiling low-level GPU code.
4.1 Level 1: The PyTorch Approach
In our journey from high-level frameworks to bare-metal CUDA, we need a reliable point of reference: a "gold standard" against which we can verify every subsequent implementation. For this, we turn to PyTorch. Its combination of a high-level API, automatic differentiation, and highly-optimized, battle-tested backend libraries makes it the perfect candidate for our trusted implementation.
The goal in this section is not just to build a PyTorch model, but to build it in a way that is reproducible and comparable to our future low-level versions. This PyTorch implementation will be our benchmark for correctness (do we get the same loss?) and performance. The full code for this section is in the v1.py script.
Before we start our PyTorch implementation, let’s introduce the problem we’ll be solving. Our task is to classify handwritten digits using the famous MNIST dataset, often considered the "hello world" of computer vision, as shown in Figure 4.1.