chapter four

4 Add captions to images

 

This chapter covers

  • Similarities between image-to-text generation and text-to-image generation
  • Building a transformer from scratch to add captions to images
  • Training an image-to-text transformer with imagecaption pairs
  • Adding captions to images with the trained image-to-text transformer

In the previous chapter, we took our first steps in connecting vision and language by learning how models can align these two very different modalities. Now we’ll build on that foundation. While our ultimate goal is text-to-image generation, we’ll first master the reverse process, image-to-text captioning, because both directions rely on the same underlying principle of learning deep cross-modal relationships.

Training a model to add captions to images is conceptually and practically more accessible than generating a new image from a text prompt. Captioning images forces the model to understand and encode visual features and then map them coherently to linguistic tokens. This not only builds intuition for handling multimodal data but also lays the groundwork for more ambitious tasks, such as producing realistic images from text prompts, which require the model to reverse the process with added complexity.

4.1 Training and using a transformer to add captions

4.1.1 Preparing data and the causal attention mask

4.1.2 Creating and training a transformer

4.2 Preparing the training dataset

4.2.1 Downloading and visualizing Flickr 8k images

4.2.2 Building a vocabulary of tokens

4.2.3 Preparing the training dataset

4.3 Creating a multimodal transformer to add captions

4.3.1 Defining a ViT as the image encoder

4.3.2 Creating the decoder to generate text

4.4 Training and using the image-to-text transformer

4.4.1 Training the encoder–decoder transformer

4.4.2 Adding captions to images with the trained model

Summary