13 New developments and challenges in text-to-image generation

 

This chapter covers

  • How state-of-the-art text-to-image generators work
  • Challenges and concerns faced by text-to-image models
  • Creating a model to distinguish real images from deepfakes
  • Preparing a large-scale dataset of real and fake images for fine-tuning
  • Testing the fine-tuned model on unseen images

By now, we have explored two ways of text-to-image generation, building models from scratch and unlocking the creative potential of modern AI. From early transformer-based generators to cutting-edge diffusion models, we’ve seen how machines can now turn simple prompts into breathtaking images based on text prompts.

Yet, with these advances come profound new challenges. As the quality and realism of generated images have skyrocketed, so too have the risks. Deepfakes, AI-generated images and videos designed to deceive, are now increasingly indistinguishable from real photographs. This presents not only technical hurdles, but also ethical, legal, and societal dilemmas. How do we ensure these powerful tools are used responsibly? Can we reliably detect AI-generated images, and what are the broader consequences when detection fails?

13.1 State-of-the-art text-to-image generators

13.1.1 DALL-E series

13.1.2 Google's Imagen

13.1.3 Latent diffusion models: Stable Diffusion and MidJourney

13.2 Challenges and concerns

13.3 A blueprint to fine-tune ResNet-50

13.3.1 The history and architecture of ResNet-50

13.3.2 A plan to fine-tune ResNet-50 for classification

13.3.3 Use ResNet-50 to classify images

13.4 Fine-tune ResNet-50 to detect fake images

13.4.1 Download and preprocess real and fake face images

13.4.2 Fine-tune ResNet-50

13.4.3 Detect deep fakes using the fine-tuned ResNet-50

13.5 Summary