10 Rejection sampling
Rejection Sampling (RS) is a popular and simple baseline for performing preference fine-tuning. Rejection sampling operates by curating new candidate completions, filtering them based on a trained reward model, and then fine-tuning the original model only on the top completions.
The name originates from computational statistics [1], where one wishes to sample from a complex distribution, but does not have a direct method to do so. To alleviate this, one samples from a simpler to model distribution and uses a heuristic to check if the sample is permissible. With language models, the target distribution is high-quality completions to prompts, the filter is a reward model, and the sampling distribution is the current model.
Many prominent RLHF and preference fine-tuning papers have used rejection sampling as a baseline, but a canonical implementation and documentation does not exist.
WebGPT [2], Anthropic’s Helpful and Harmless agent [3], OpenAI’s popular paper on process reward models [4], Llama 2 Chat models [5], and other seminal works all use this baseline.
10.1 Training Process
A visual overview of the rejection sampling process is included below in Figure 10.1.
Figure 10.1 Rejection sampling overview.
10.1.1 Generating Completions
Let’s define a set of \(M\) prompts as a vector:
\[X = [x_1, x_2, ..., x_M]\]
These prompts can come from many sources, but most popularly they come from the instruction training set.