chapter three

3 Classify images with a vision transformer

 

This chapter covers

  • Dividing an image into patches of tokens
  • Training a transformer to predict the next image token
  • Classifying CIFAR-10 images using a vision transformer (ViT)
  • Visualizing how a ViT pays attention to different parts of an image

Building on the ideas from the previous chapter, where we explored how transformers handle sequential data in language, we can now extend this perspective to images. In transformer-based text-to-image generation, a pivotal step is converting an image into a sequence of tokens, much like how we process words in a sentence in natural language. This is where vision transformers (ViTs) come in. ViTs, introduced by Google researchers in their landmark 2020 paper “An Image Is Worth 16 × 16 Words” [1], brought the power of transformer architectures, originally designed for natural language, to the world of computer vision. This innovation allows us to use attention-based mechanisms to connect text and images in a unified framework.

3.1 The blueprint to train a ViT

3.1.1 Converting images to sequences

3.1.2 Training a ViT for classification

3.2 The CIFAR-10 dataset

3.2.1 Downloading and visualizing CIFAR-10 images

3.2.2 Preparing datasets for training and testing

3.3 Building a ViT from scratch

3.3.1 Dividing images into patches

3.3.2 Modeling the positions of different patches in an image

3.3.3 Using the multi-head self-attention mechanism

3.3.4 Building an encoder-only transformer

3.3.5 Using the ViT to create a classifier

3.4 Training and using the ViT to classify images

3.4.1 Choosing the optimizer and the loss function

3.4.2 Training the ViT for image classification

3.4.3 Classifying images using the trained ViT

Summary