chapter ten

10 Generative Models for De Novo Design

This chapter covers

Challenges of navigating chemical space and limitations of traditional methods.
How generative models, particularly autoencoders, can learn a compressed "latent space" representation of molecules.
The architectural components of autoencoders, including tokenization, embedding layers, and encoder-decoder structures.
Why standard autoencoders fail at generating novel molecules and how Variational Autoencoders (VAEs) solve this with a probabilistic approach.
Advanced techniques like Recurrent Neural Networks (GRUs), cyclical annealing, and sophisticated tokenization that create powerful generative models for chemistry.

The journey of discovering a new drug is often likened to finding a needle in a colossal haystack. It's a process fraught with challenges, immense costs, and high attrition rates. At its heart, drug discovery is a molecular design problem: identifying or creating a molecule with the precise set of properties needed to safely and effectively treat a disease. This chapter delves into computational techniques that aim to make this quest more efficient and targeted, designing novel molecules with desired characteristics.

10.1 The Quest for Designer Molecules

10.1.1 The Challenge of Chemical Space

10.1.2 Generative Models: A New Paradigm for Molecular Design

10.1.3 Reinforcement Learning for Targeted Generation

10.2 Building the World: Generative Models for Molecules

10.2.1 Essential Properties of a Good Molecular Latent Space

10.2.2 Learning to Compress and Recreate: The Autoencoder

10.2.3 The Autoencoder Architecture

10.2.4 Experiment on the MOSES Benchmark

10.3 Creating a Continuous Chemical Universe: Variational Autoencoders

10.3.1 The Variational Autoencoder

10.3.2 Posterior Collapse and Cyclic VAE

10.3.3 Monitoring Metrics

10.3.4 Training & Evaluating VAE-CYC

10.4 Understanding Sequential Molecular Structure: Recurrent Neural Networks

10.4.1 How RNNs Process Sequences

10.4.2 Resolving Vanishing Gradients with Gated Recurrent Units

10.4.3 Sequence-to-Sequence Architecture: Encoding and Decoding Molecules

10.4.4 Revisiting Tokenization: Byte-Pair Encoding for Molecules

10.4.5 Putting It All Together: VAE-CYC

10.5 Summary

10.6 References