chapter ten

10 Rejection sampling

 

Rejection Sampling (RS) is a popular and simple baseline for performing preference fine-tuning. Rejection sampling operates by curating new candidate completions, filtering them based on a trained reward model, and then fine-tuning the original model only on the top completions.

The name originates from computational statistics [1], where one wishes to sample from a complex distribution, but does not have a direct method to do so. To alleviate this, one samples from a simpler to model distribution and uses a heuristic to check if the sample is permissible. With language models, the target distribution is high-quality completions to prompts, the filter is a reward model, and the sampling distribution is the current model.

Many prominent RLHF and preference fine-tuning papers have used rejection sampling as a baseline, but a canonical implementation and documentation does not exist.

WebGPT [2], Anthropic’s Helpful and Harmless agent [3], OpenAI’s popular paper on process reward models [4], Llama 2 Chat models [5], and other seminal works all use this baseline.

10.1 Training Process

A visual overview of the rejection sampling process is included below in Figure 10.1.

Figure 10.1 Rejection sampling overview.
figure

10.1.1 Generating Completions

Let’s define a set of \(M\) prompts as a vector:

\[X = [x_1, x_2, ..., x_M]\]

These prompts can come from many sources, but most popularly they come from the instruction training set.

10.1.2 Selecting Top-N Completions

10.1.3 Fine-tuning

10.1.4 Details