9 Instruction finetuning
Early language models were only trained to predict the next tokens in a sequence and were not adapted to any specific tasks. Around the release of GPT-3 [1], language models were still primarily used via in-context learning where examples were shown to the model and then it was asked to complete a similar task.
This was the combination of two trends – historically in the natural language processing (NLP) literature, models were trained for a specific task. Here, as seen with one example where bigger models generalize better, multiple results showed how standardizing the approach of task data can enable dramatically different downstream performance. Prominent examples of unifying the framework for tasks include Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer (T5 models) [2], Finetuned Language Models Are Zero-Shot Learners (FLAN dataset) [3], Multitask Prompted Training Enables Zero-Shot Task Generalization (T0 models) [4], and Cross-Task Generalization via Natural Language Crowdsourcing Instructions (Natural Instructions dataset) [5]. These insights led to the era of finetuning language models. Historically, until RLHF and related methods, all finetuning was instruction finetuning (IFT), also known as supervised finetuning.