chapter twelve

12 Generative Image Models

 

This chapter covers

  • Understanding the intuition of generative image models
  • Cleaning and preparing image data for training
  • Grasping and implementing the steps to train a diffusion model
  • Implementing major components for a diffusion model and U-Net

12.1 What are generative image models?

Generative image models are a family of algorithms and artificial neural network structures that are specialized towards generating accurate images based on human language input.

Imagine a sculptor who has spent his life observing people who are in deep thought. For years, he walks around the town and thoroughly studies every aspect of every person that he sees in thinking deeply - carefully studying their posture, expressions, and subtle details. Over time, he internalizes what it means to look like someone thinking. He might have seen hundreds of thousands of people pass through the town over the years.

We blindfold the sculptor and give him a random block of marble, and ask “Make me a sculpture of a person thinking”. The sculptor can’t add new material to marble; instead, he feels the block of marble and chips away a small piece that he’s confident does not look like a person thinking. He repeats this thousands of times, each time feeling the edges of the marble, and chipping away a little more “noise”. Slowly and methodically, a coherent image of a person thinking emerges from within the random block of stone (figure 12.1).

12.2 The intuition behind image generation

12.2.1 A generative image model training workflow

12.3 Preparing image training data

12.3.1 Selecting and collecting image data

12.3.2 Cleaning and preprocessing image data

12.4 Embedding: From images to numbers

12.5 Designing the architecture (and why U-Nets)

12.5.1 Convolutional Neural Networks (CNNs)

12.5.2 The U-Net (A specialized CNN)

12.6 Denoising: From numbers to an image

12.6.1 Encoder: Down-sampling layers

12.6.2 Bridge (also known as the bottleneck)

12.6.3 Decoder: Up-sampling layers

12.7 Learning: Calculating loss and backpropagation

12.7.1 Calculating loss

12.7.2 Backpropagation

12.8 Generating an image

12.8.1 Starting with a blank canvas (of pure noise)

12.8.2 Denoising the data

12.9 Controlling the diffusion model

12.9.1 Training data composition and diversity

12.9.2 Timesteps and noise schedule

12.9.3 Attention layers and cross-attention injection

12.9.4 Training epochs

12.10 Inpainting and Outpainting

12.11 LoRA (Low-Rank Adaptation)