2 Tuning for a specific domain
This chapter covers
- Preparing data for LLM customization
- The basics of retrieval-augmented generation
- Fine-tuning an LLM
- Alternatives to fine-tuning
Now that you understand the fundamentals of domain-specific LLMs, we’ll look at how you can customize popular open source foundation models using your own data. This chapter and chapters 13 and 15 are the only chapters that will cover tuning; most of the book will focus on inference.
2.1 Data preparation
Fine-tuning a Transformer model for a given task starts with formatting your dataset for training. In this section, we’ll look at two PyTorch examples using Hugging Face Transformers (https://github.com/huggingface/transformers): an encoder-only model (BERT) and a decoder-only model (GPT-2). You’ll see that the workflow is largely the same, with only minor changes based on the model architecture and task. Finally, you’ll see how to prepare data for cases where retrieval-augmented generation (RAG) is a better choice than fine-tuning.
2.1.1 Data preparation for BERT fine-tuning
Let’s consider a classification task using the pretrained bert-base-uncased model (https://huggingface.co/bert-base-uncased). We first need to gather our data, making sure each text sample has a class label. Format it as a list of tuples like this:
dataset = [("This movie is a masterpiece!", "Positive"),
("Not worth watching!", "Negative"),
("Terrific!", "Positive") go here
]