chapter four

4 Add captions to images

This chapter covers

Similarities between image-to-text generation and text-to-image generation
Building a transformer from scratch to add captions to images
Training an image-to-text transformer with image-caption pairs
Adding captions to images with the trained image-to-text transformer

Training a multi-modal transformer for image-to-text generation (i.e., adding captions to images) and training one for text-to-image generation (i.e., generating images based on textual descriptions) share several similarities, primarily because both tasks involve learning complex mappings between textual and visual modalities. Understanding and training multi-modal transformers for image-to-text generation first can provide valuable insights and foundational knowledge, making the more complex task of text-to-image generation more manageable. For this very reason, you'll build and train an image-to-text transformer from scratch in this chapter to lay a solid foundation for text-to-image generation skills in later chapters.

Specifically, you'll go through all the steps needed to build a multi-modal transformer to add captions to images. You'll use a dataset with image-caption pairs as the training data to train the model. Once the model is trained, you can feed an image to the model and obtain a coherent caption describing what's in the image.

4.1 How to train and use a transformer to add captions

4.1.1 Data preparation and the causal attention mask

4.1.2 Create and train a transformer

4.2 Prepare the training dataset

4.2.1 Download and visualize Flickr 8k images

4.2.2 Build a vocabulary of tokens

4.2.3 Prepare the training dataset

4.3 Create a multi-modal transformer to add captions

4.3.1 Define a vision transformer as the image encoder

4.3.2 The decoder to generate text

4.4 Train and use the image-to-text transformer

4.4.1 Train the encoder-decoder transformer

4.4.2 Add captions to images with the trained model

4.5 Summary