In this chapter
- Understanding the intuition of generative image models
- Cleaning and preparing image data for training
- Grasping and implementing the steps to train a diffusion model
- Implementing major components for a diffusion model and U-Net
Generative image models are a family of algorithms and artificial neural network (ANN) structures that specialize in generating accurate images based on human-language input. Imagine a sculptor who has spent his life observing people who are in deep thought. For years, he walked around town and thoroughly studied every aspect of every person he saw thinking deeply, observing their posture, expressions, and subtle details. Over time, he internalized what it means to look like someone thinking. He might have seen hundreds of thousands of people pass through the town over the years.
We blindfold the sculptor, give him a random block of marble, and say, “Make me a sculpture of a person thinking.” The sculptor can’t add new material to marble; instead, he feels the block of marble and chips away a small piece that he’s confident does not look like a person thinking. He repeats this process thousands of times, each time feeling the edges of the marble and chipping away a little more noise. Slowly and methodically, a coherent image of a person thinking emerges from the random block of stone (figure 12.1).