chapter thirteen

13 Constitutional AI and AI feedback

RL from AI Feedback (RLAIF) is a larger set of techniques for using AI to augment or generate feedback data, including pairwise preferences [1] [2] [3]. There are many motivations to using RLAIF to either entirely replace human feedback or augment it. AI models are far cheaper than humans, with a single piece of human preference data costing on the order of $1 or higher (or even above $10 per prompt), AI feedback with a frontier AI model, such as GPT-4o costs less than $0.01. This cost difference opens the market of experimentation with RLHF methods to an entire population of people previously priced out. Other than price, AI feedback introduces different tradeoffs on performance than human feedback, which are still being investigated. The peak performance for AI feedback is at least in the same ballpark of human data on skill-based evaluations, but it is not studied if human data allows finer control of the models in real-world product settings or for newer training methods such as character training.

13 Constitutional AI and AI feedback

13.1 Constitutional AI

13.2 Specific LLMs for Judgement

13.3 Further Reading