chapter twelve

12 Generative Image Models

This chapter covers

Understanding the intuition of generative image models
Cleaning and preparing image data for training
Grasping and implementing the steps to train a diffusion model
Implementing major components for a diffusion model and U-Net

What are generative image models?

Generative image models are a family of algorithms and artificial neural network structures that are specialized towards generating accurate images based on human language input.

Imagine a sculptor who has spent his life observing people who are in deep thought. For years, he walks around the town and thoroughly studies every aspect of every person that he sees in thinking deeply - carefully studying their posture, expressions, and subtle details. Over time, he internalizes what it means to look like someone thinking. He might have seen hundreds of thousands of people pass through the town over the years.

We blindfold the sculptor and give him a random block of marble, and ask “Make me a sculpture of a person thinking”. The sculptor can’t add new material to marble; instead, he feels the block of marble and chips away a small piece that he’s confident does not look like a person thinking. He repeats this thousands of times, each time feeling the edges of the marble, and chipping away a little more “noise”. Slowly and methodically, a coherent image of a person thinking emerges from within the random block of stone (figure 12.1).

The intuition behind image generation

A generative image model training workflow

Preparing image training data

Selecting and collecting image data

Cleaning and preprocessing image data

Embedding: From images to numbers

Designing the architecture (and why U-Nets)

Convolutional Neural Networks (CNNs)

The U-Net (A specialized CNN)

Denoising: From numbers to an image

Encoder: Down-sampling layers

Bridge (also known as the bottleneck)

Decoder: Up-sampling layers

Learning: Calculating loss and backpropagation

Calculating loss

Backpropagation

Generating an image

Starting with a blank canvas (of pure noise)

Denoising the data

Controlling the diffusion model

Training data composition and diversity

Timesteps and noise schedule

Attention layers and cross-attention injection

Training epochs

Inpainting and Outpainting

LoRA (Low-Rank Adaptation)

High-Resolution Fix and Upscalers

ControlNets and IP-Adapters

Refining Aesthetics with Human Feedback

Use cases for image generation

Creative ideation and concept art

Commercial design and advertising

Content creation and media

Personalization and Photo Editing