chapter four

4 Aligning language models with reinforcement learning

This chapter covers

Reinforcement learning and its usefulness in fine-tuning large language models
Training a model using proximal policy optimization with a custom reward function
Using reinforcement learning to fine-tune a language model from real-time user feedback

Large language models (LLMs) have gained popularity recently due to their ability to generate human-like responses to text prompts. LLMs have unlocked new capabilities in applications such as web search, coding tasks, and text understanding. Research and innovation in this space have been advancing at a rapid pace, with multiple companies competing to create general-purpose foundational models, such as ChatGPT, as well as domain-specific models trained for specialized tasks.

A growing area of interest for large language models is their ability to adapt to specific use cases and domains. This is an advantage for organizations that wish to train and run their own LLMs in-house, thereby avoiding the exposure of personally identifiable information or sensitive data or exerting federated control over LLM use. Another risk is that deployed LLMs become stale or generate outdated information because the model lacks access to the data created since it was last trained. Since information in the real world changes rapidly, there is a need for these models to be adaptive.

4.1 Introduction to reinforcement learning

4.1.1 Understanding reinforcement learning

4.1.2 Proximal Policy Optimization (PPO)

4.1.3 Group Relative Policy Optimization (GRPO)

4.1.4 Direct Preference Optimization (DPO)

4.2 Tuning a language model with reinforcement learning

4.2.1 Defining a reinforcement learning problem

4.2.2 Training the model

4.3 Learning from human feedback in real-time

4.3.1 Setting up the publisher and subscriber

4.3.2 Create the chatbot application to manage user interactions

4.3.3 Using reinforcement learning to generate responses to user prompts

4.3.4 Training the model in real-time based on user ratings

4.3.5 Running the application

4.4 Summary