chapter one

1 Overview

 

This chapter covers

  • Why RLHF is and became important
  • An intuition for how RLHF changes models
  • An overview of RLHF and this book

Reinforcement learning from Human Feedback (RLHF) is a technique used to incorporate human information into AI systems. RLHF emerged primarily as a method to solve hard to specify problems. With systems that are designed to be used by humans directly, such problems emerge all the time due to the often unexpressible nature of an individual’s preferences. This encompasses every domain of content and interaction with a digital system. RLHF’s early applications were often in control problems and other traditional domains for reinforcement learning (RL), where the goal is to optimize a specific behavior to solve a task. The core idea to start the field of RLHF was “can we solve hard problems only with basic preference signals guiding the optimization process.” RLHF became most known through the release of ChatGPT and the subsequent rapid development of large language models (LLMs) and other foundation models.

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF