3 Definitions & background
Here are the definitions, symbols, and operations frequently used in the RLHF process and an overview of language models (the common optimization target of this book).
3.1 Language Modeling Overview
The majority of modern language models are trained to learn the joint probability distribution of sequences of tokens (words, subwords, or characters) in an autoregressive manner. Autoregression means that each next prediction depends on the previous entities in the sequence. Given a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\), the model factorizes the probability of the entire sequence into a product of conditional distributions:
\[P_{\theta}(x) = \prod_{t=1}^{T} P_{\theta}(x_t \mid x_{1}, \ldots, x_{t-1}).\] In order to fit a model that accurately predicts this, the goal is often to maximize the likelihood of the training data as predicted by the current model. To do so we can minimize a negative log-likelihood (NLL) loss:
\[\mathcal{L}_{\text{LM}}(\theta)=-\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T}\log P_{\theta}\left(x_t \mid x_{<t}\right)\right].\]
In practice, one uses a cross-entropy loss with respect to each next-token prediction, computed by comparing the true token in a sequence to what was predicted by the model.