4 Add captions to images
This chapter covers
- Similarities between image-to-text generation and text-to-image generation
- Building a transformer from scratch to add captions to images
- Training an image-to-text transformer with image-caption pairs
- Adding captions to images with the trained image-to-text transformer
Training a multi-modal transformer for image-to-text generation (i.e., adding captions to images) and training one for text-to-image generation (i.e., generating images based on textual descriptions) share several similarities, primarily because both tasks involve learning complex mappings between textual and visual modalities. Understanding and training multi-modal transformers for image-to-text generation first can provide valuable insights and foundational knowledge, making the more complex task of text-to-image generation more manageable. For this very reason, you'll build and train an image-to-text transformer from scratch in this chapter to lay a solid foundation for text-to-image generation skills in later chapters.
Specifically, you'll go through all the steps needed to build a multi-modal transformer to add captions to images. You'll use a dataset with image-caption pairs as the training data to train the model. Once the model is trained, you can feed an image to the model and obtain a coherent caption describing what's in the image.