chapter one

1 Introduction

This chapter covers

Why RLHF is and became important
An intuition for how RLHF changes models
An overview of RLHF and this book

Reinforcement learning from Human Feedback (RLHF) is a technique used to incorporate human information into AI systems. RLHF emerged primarily as a method to solve hard-to-specify problems. With systems that are designed to be used by humans directly, such problems emerge all the time due to the often unexpressible nature of an individual’s preferences. This encompasses every domain of content and interaction with a digital system. RLHF’s early applications were often in control problems and other traditional domains for reinforcement learning (RL), where the goal is to optimize a specific behavior to solve a task. The core idea to start the field of RLHF was “can we solve hard problems only with basic preference signals guiding the optimization process.” RLHF became most known through the release of ChatGPT and the subsequent rapid development of large language models (LLMs) and other foundation models.

1 Introduction

This chapter covers

1.1 What Does RLHF Do?

1.2 An Intuition for Post-Training

1.3 How We Got Here

1.4 Future of RLHF

Summary