chapter four

4 Add captions to images

This chapter covers

Similarities between image-to-text generation and text-to-image generation
Building a transformer from scratch to add captions to images
Training an image-to-text transformer with image-caption pairs
Adding captions to images with the trained image-to-text transformer

This chapter marks an important step in our journey to create models that bridge vision and language. While the ultimate goal is to build systems that generate images from text prompts (text-to-image), it is just as essential to master the reverse process: generating descriptive captions for images (image-to-text). Both tasks, despite working in opposite directions, rely on the same core principle: learning deep relationships between two fundamentally different modalities: visual data and natural language.

Training a model to add captions to images is conceptually and practically more accessible than generating a new image from a text prompt. Captioning images forces the model to understand and encode visual features and then map them coherently to linguistic tokens. This not only builds intuition for handling multimodal data but also lays the groundwork for more ambitious tasks, such as producing realistic images from text prompts, which require the model to reverse the process with added complexity.

4.1 How to train and use a transformer to add captions

4.1.1 Data preparation and the causal attention mask

4.1.2 Create and train a transformer

4.2 Prepare the training dataset

4.2.1 Download and visualize Flickr 8k images

4.2.2 Build a vocabulary of tokens

4.2.3 Prepare the training dataset

4.3 Create a multi-modal transformer to add captions

4.3.1 Define a vision transformer as the image encoder

4.3.2 The decoder to generate text

4.4 Train and use the image-to-text transformer

4.4.1 Train the encoder-decoder transformer

4.4.2 Add captions to images with the trained model

4.5 Summary