Appendix A. Definitions
This appendix includes all the definitions, symbols, and operations frequently used in the RLHF process and with a quick overview of language models, which is the guiding application of this book.
A.1 Language Modeling Overview
The majority of modern language models are trained to learn the joint probability distribution of sequences of tokens (words, subwords, or characters) in an autoregressive manner. Autoregression simply means that each next prediction depends on the previous entities in the sequence. Given a sequence of tokens \(x = (x_1, x_2, \ldots, x_T)\), the model factorizes the probability of the entire sequence into a product of conditional distributions:
Equation A.1
\[\label{eq:llming}{P_{\theta}(x) = \prod_{t=1}^{T} P_{\theta}(x_{t} \mid x_{1}, \ldots, x_{t-1}).}\]In order to fit a model that accurately predicts this, the goal is often to maximize the likelihood of the training data as predicted by the current model. To do so, we can minimize a negative log-likelihood (NLL) loss:
Equation A.2
\[\label{eq:nll}{\mathcal{L}_{\text{LM}}(\theta)=-\mathbb{E}_{x \sim \mathcal{D}}\left[\sum_{t=1}^{T}\log P_{\theta}\left(x_t \mid x_{<t}\right)\right]. }\]In practice, one uses a cross-entropy loss with respect to each next-token prediction, computed by comparing the true token in a sequence to what was predicted by the model.