preface
This book begins with my curiosity about how machines could create images from nothing more than words. When I first encountered DALL-E and Stable Diffusion, the results seemed magical: type a prompt, and out came a lifelike image that matched the description perfectly. But behind the magic were mathematics, code, and a long line of ideas in machine learning. I wanted to demystify those ideas, not just for myself, but for anyone who learns best by building things from scratch.
Generative AI is advancing at a pace few of us could have predicted, reshaping not only the way we work but also how we create, design, and communicate. Text-to-image models in particular are among the most visible and transformative of these technologies. They embody the leap from unimodal to multimodal AI, systems that reason across different types of data. While the headlines focused on their impressive outputs, I found myself drawn to this question: How do they really work? The only satisfying answer, I decided, was to build one myself.