appendix-c

Appendix C. Knowledge Distillation: Shrinking Models for Efficient, Hierarchical Molecular Generation

 

This chapter covers

  • The Hierarchical Variational Autoencoder (HierVAE) for generating molecules by assembling chemically valid substructures.
  • Core concepts of knowledge distillation, showing how a compact "student" model can learn from a larger "teacher" model.
  • How to apply knowledge distillation to compress a large, pre-trained HierVAE model into a smaller, faster version.
  • A complete implementation pipeline, including student model design, a multi-component loss function, and training strategies.
  • Key metrics like generation speed, model size, validity, and uniqueness to analyze trade-offs.
“Given a pre-existing model, we can rebuild it. We have the technology. We can make it smaller than it ever was. Smaller, cheaper, faster!”

--- The Six Million Dollar Man (paraphrased)

C.1 Generative Chemistry as a Motivating Use Case

C.1.1 The Evolution of Molecular Generation

C.1.2 Hierarchical Molecular Generation

C.1.3 HierVAE Architecture for Hierarchical Molecular Generation

C.1.4 Bridge to Knowledge Distillation

C.2 Core Knowledge Distillation Concepts

C.2.1 The Knowledge Distillation Paradigm

C.2.2 Tapping into Dark Knowledge

C.2.3 Controlling Information with Temperature

C.2.4 Online vs. Offline Distillation

C.2.5 Expanding the Dataset with Pseudo-Labeling

C.3 Assembly: Putting it All Together

C.3.1 Multi-component Distillation Loss

C.3.2 Training Strategy: KL Annealing and the Dual Forward Pass

C.3.3 Student Model Design: Balancing Compression and Capability

C.3.4 End-to-end Knowledge Distillation

C.3.5 Future Directions

C.4 Summary

C.5 References