chapter nine

9 Rejection Sampling

 

This chapter covers

  • What rejection sampling is
  • Why rejection sampling may be used instead of RL

Rejection Sampling (RS) is a popular and simple baseline for performing preference fine-tuning. This makes it one of a handful of methods that are used after a first round of instruction tuning in order to further refine the model to human preferences. Rejection sampling operates by curating new candidate completions, filtering them based on a trained reward model, and then instruction fine-tuning the original model only on the top completions (same loss function as when doing a dedicated training stage for learning to follow instructions).

The name originates from computational statistics [1], where one wishes to sample from a complex distribution, but does not have a direct method to do so. To alleviate this, one samples from a simpler distribution to model and uses a heuristic to check if the sample is permissible. With language models, the target distribution is high-quality completions to prompts, the filter is a reward model, and the sampling distribution is the current model.

Many prominent RLHF and preference fine-tuning papers have used rejection sampling as a baseline, but a canonical implementation and documentation does not exist.

9.1 Training Process

9.1.1 Generating Completions

9.1.2 Scoring Completions

9.1.3 Fine-tuning

9.2 Implementation Details