chapter three

3 Classify images with a vision transformer (ViT)

This chapter covers

Dividing an image into patches of tokens
Training a transformer to predict the next image token
Classifying CIFAR-10 images using a trained vision transformer (ViT)
Visualizing how a trained ViT pays attention to different parts of an image

In transformer-based text-to-image generation, a crucial step involves converting an image into a sequence of tokens, similar to how text is transformed into a sequence of words. This process is carried out using a vision transformer (ViT), a groundbreaking approach that has revolutionized the application of transformer models to computer vision tasks. The concept of ViT was introduced in 2020 by a team of researchers at Google in their seminal paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.[1] Their work demonstrated how transformers, originally designed for natural language processing (NLP), could achieve state-of-the-art performance across a wide range of computer vision challenges.

3.1 The blueprint to train a vision transformer

3.1.1 How to convert images to sequences

3.1.2 How to train a vision transformer for classification

3.2 The CIFAR-10 dataset

3.2.1 Download and visualize CIFAR-10 images

3.2.2 Prepare datasets for training and testing

3.3 Build a vision transformer (ViT) from scratch

3.3.1 Divide images into patches

3.3.2 Model the positions of different patches in an Image

3.3.3 The multi-head self-attention mechanism

3.3.4 Build an encoder-only transformer

3.3.5 Use the vision transformer to create a classifier

3.4 Train and use the vision transformer to classify images

3.4.1 Choose the optimizer and the loss function

3.4.2 Train the vision transformer for image classification

3.4.3 Classify images using the trained ViT

3.5 Summary