4 Training overview
4.1 Problem Formulation
The optimization of reinforcement learning from human feedback (RLHF) builds on top of the standard RL setup. In RL, an agent takes actions, \(a\), sampled from a policy, \(\pi\), with respect to the state of the environment, \(s\), to maximize reward, \(r\) [1]. Traditionally, the environment evolves with respect to a transition or dynamics function \(p(s_{t+1}|s_t, a_t)\). Hence, across a finite episode, the goal of an RL agent is to solve the following optimization:
\[J(\pi) = \mathbb{E}_{\tau \sim \pi} \left[ \sum_{t=0}^{\infty} \gamma^t r(s_t, a_t) \right],\]
where \(\gamma\) is a discount factor from 0 to 1 that balances the desirability of near- versus future-rewards. Multiple methods for optimizing this expression are discussed in Chapter 11.
Figure 4.1 Standard RL loop
4.1.1 Manipulating the Standard RL Setup
There are multiple core changes from the standard RL setup to that of RLHF: