chapter three
This chapter covers
- Dividing an image into patches of tokens
- Training a transformer to predict the next image token
- Classifying CIFAR-10 images using a vision transformer (ViT)
- Visualizing how a ViT pays attention to different parts of an image
Building on the ideas from the previous chapter, where we explored how transformers handle sequential data in language, we can now extend this perspective to images. In transformer-based text-to-image generation, a pivotal step is converting an image into a sequence of tokens, much like how we process words in a sentence in natural language. This is where vision transformers (ViTs) come in. ViTs, introduced by Google researchers in their landmark 2020 paper “An Image Is Worth 16 × 16 Words” [1], brought the power of transformer architectures, originally designed for natural language, to the world of computer vision. This innovation allows us to use attention-based mechanisms to connect text and images in a unified framework.