3 Classify images with a vision transformer (ViT)
This chapter covers
- Dividing an image into patches of tokens
- Training a transformer to predict the next image token
- Classifying CIFAR-10 images using a vision transformer (ViT)
- Visualizing how a ViT pays attention to different parts of an image
In transformer-based text-to-image generation, a pivotal step is converting an image into a sequence of tokens, much like how we process words in a sentence in natural language. This is where vision transformers (ViTs) come in. ViTs, introduced by Google researchers in their landmark 2020 paper “An Image is Worth 16x16 Words,” brought the power of transformer architectures, originally designed for natural language, to the world of computer vision.[1] Their innovation is to treat images as sequences of patches, enabling transformers to excel at image classification and beyond.
This chapter guides you through the core ideas and practical implementation of ViTs. You’ll build a ViT from scratch and train it to classify images from the widely used CIFAR-10 dataset. The process begins by splitting each image into an 8×8 grid, resulting in 64 patches. Each patch is treated as a token in the transformer’s input sequence. This clever adaptation enables us to bring the strengths of transformers, such as self-attention and long-range dependency modeling, into computer vision tasks.