4 Understanding How Stable Diffusion Works

 
“I suffer from an incurable need to understand.”

- Professor Sogol, Mount Analogue

This chapter covers

  • What we mean by AI, Machine Learning and Neural Networks
  • High level look at what Stable Diffusion is doing
  • Understanding how the Variational Autoencoder works to transform images
  • Learning about CLIP text encoding and tokenizing
  • Discussion how Stable Diffusion learns to create images from noise

Now that we’ve got a good feeling for what Stable Diffusion can do, it will be helpful to understand a bit more about how Stable Diffusion works under the hood. This will be useful for the rest of the book because increasingly our ability to customize the behavior of Stable Diffusion will require us to have a progressively deeper understanding of how it works.

When talking about Stable Diffusion specifically or diffusion models in general (such as Midjourney or Dall-e) you’ve likely heard terms like “Artificial Intelligence (AI)”, “Machine Learning” and “Neural Networks”. Even for experts in these areas the precise definition of these terms, especially AI, can be hard to pin down and can be subject to a lot of marketing hype. We’ll start by clearly defining these terms so we can speak clearly about these without getting confused by the hype surrounding them.

4.1 Artificial Intelligence, Machine Learning and Neural Networks

Before diving too deep into how Stable Diffusion works, it will be extremely helpful to define three common terms:

4.1.1 Artificial Intelligence

4.1.2 Machine Learning

4.1.3 Neural Networks

4.2 Overview of How Stable Diffusion Works

4.2.1 A High Level overview of how Diffusion models work.

4.3 The Main Components of Stable Diffusion

4.3.1 Compressing Images with the Variational Autoencoder

4.3.2 Transforming Text with the CLIP Encoder

4.3.3 Estimating Noise with the U-NET

4.3.4 Putting it all together with the Scheduler/Sampler

4.4 There and back again, Stable Diffusion in code.

4.5 Summary