chapter thirteen

13 New developments and challenges in text-to-image generation

 

This chapter covers

  • How state-of-the-art text-to-image generators work
  • Text-to-image model challenges and concerns
  • Creating a model to distinguish real images from deepfakes
  • Preparing a large-scale dataset of real and fake images for fine-tuning
  • Testing fine-tuned models on unseen images

By now, we’ve explored two methods of text-to-image generation: building models from scratch and unlocking the creative potential of modern AI. From early transformer-­based generators to cutting-edge diffusion models, we’ve seen how machines can now turn simple prompts into breathtaking images based on text prompts.

Yet, with these advances come profound new challenges. As the quality and realism of generated images have skyrocketed, so too have the risks. Deepfakes, AI-­generated images and videos designed to deceive, are now increasingly indistinguishable from real photographs. This presents not only technical hurdles but also ethical, legal, and societal dilemmas. How do we ensure these powerful tools are used responsibly? Can we reliably detect AI-generated images, and what are the broader consequences when detection fails?

13.1 State-of-the-art text-to-image generators

13.1.1 DALL-E series

13.1.2 Google’s Imagen

13.1.3 Latent diffusion models: Stable Diffusion and Midjourney

13.2 Challenges and concerns

13.3 A blueprint to fine-tune ResNet50

13.3.1 The history and architecture of ResNet50

13.3.2 A plan to fine-tune ResNet50 for classification

13.3.3 Using ResNet50 to classify images

13.4 Fine-tuning ResNet50 to detect fake images

13.4.1 Downloading and preprocessing real and fake face images

13.4.2 Fine-tuning ResNet50

13.4.3 Detecting deepfakes using the fine-tuned ResNet50

Summary