chapter sixteen

16 Synthetic data and distillation

Reinforcement learning from human feedback is deeply rooted in the idea of keeping a human influence on the models we are building. When the first models were trained successfully with RLHF, human data was the only viable way to improve the models in this way.

Humans were the only way to create high enough quality responses to questions to train on them. Humans were the only way to collect reliable and specific feedback data to train reward models.

As AI models got better, this assumption rapidly broke down. The possibility of synthetic data, which is far cheaper and easier to iterate on, enabled the proliferation from RLHF being the center of attention to the idea of a broader "post-training" shaping the models.

Many reports have been made on how synthetic data causes "model collapse" or other issues in models [1], but this has been emphatically rebuked in leading language models [2] [3]. Synthetic data can cause models to have performance issues, but this is caused by using repetitive data or solely data outputted by the model being trained (narrowing its potential distribution) rather than well-rounded data sources.