2 A Tiny History of RLHF · Reinforcement Learning from Human Feedback

chapter two

This chapter covers

The three eras of RLHF’s recent history
Seminal models and papers that shaped RLHF

RLHF and its related methods are very new. We highlight this history to show how recently these techniques were formalized, and how much of the documentation lives in the academic literature rather than textbooks. Some details covered here will change, but the core practices are stable. The papers listed here also showcase why the RLHF pipeline looks the way it does – some of the seminal work was for applications totally distinct from modern language models.

In this chapter we detail the key papers and projects that got the RLHF field to where it is today. This is not intended to be a comprehensive review of RLHF and the related fields, but rather a starting point and retelling of how we got to today. It is intentionally focused on recent work that led to ChatGPT. There is substantial further work in the RL literature on learning from preferences [1]. For a more exhaustive list, you should use a proper survey paper [2], [3].

Figure 2.1 Timeline of key developments in RLHF discussed in this chapter, from early work on RL from preferences through the adoption of RLHF in large language models.

This chapter covers

Figure 2.1 Timeline of key developments in RLHF discussed in this chapter, from early work on RL from preferences through the adoption of RLHF in large language models.

2.1 Origins to 2018: RL on Preferences

2.2 2019 to 2022: RL from Human Preferences on Language Models

2.3 2023 to Present: ChatGPT Era

Summary