chapter three
3 Classify images with a vision transformer (ViT)
This chapter covers
- Dividing an image into patches of tokens
- Training a transformer to predict the next image token
- Classifying CIFAR-10 images using a trained vision transformer (ViT)
- Visualizing how a trained ViT pays attention to different parts of an image
In transformer-based text-to-image generation, a crucial step involves converting an image into a sequence of tokens, similar to how text is transformed into a sequence of words. This process is carried out using a vision transformer (ViT), a groundbreaking approach that has revolutionized the application of transformer models to computer vision tasks. The concept of ViT was introduced in 2020 by a team of researchers at Google in their seminal paper, An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.[1] Their work demonstrated how transformers, originally designed for natural language processing (NLP), could achieve state-of-the-art performance across a wide range of computer vision challenges.