chapter four

4 Data Engineering for Large Language Models: Setting up for success

This chapter covers

Common foundation models used in the industry
How to evaluate and compare Large Language Models
Different data sources and how to prepare your own
Creating your own custom tokenizers and embeddings
Preparing a Slack dataset to be used in future chapters

Creating our own LLM is no different from any ML project in the fact that we will start by preparing our assets—and there isn’t a more valuable asset than your data. All successful AI and ML initiatives are built off of a good data engineering foundation. It’s important then that we acquire, clean, prepare, and curate our data.

In addition, unlike other ML models, you generally won’t be starting from scratch when creating an LLM customized for your specific task. Of course, if you do start from scratch, you’ll likely only do it once. Then it’s best to tweak and polish that model to further refine it for your specific needs. Selecting the right base model can make or break your project. Figure 4.1 gives a high-level overview of the different pieces and assets that you’ll need to prepare before training or finetuning a new model.

Figure 4.1 The different elements of training an LLM. Combining earth, fire, water… wait, no, not those elements. To get started you’ll need to collect several assets including a foundation model, training data, text encoders (e.g. tokenizer), and evaluation data.

4.1 Models Are the Foundation

4.2 Evaluating LLMs

4.2.1 Metrics for Evaluating Text

4.2.2 Industry Benchmarks

4 Data Engineering for Large Language Models: Setting up for success

This chapter covers

Figure 4.1 The different elements of training an LLM. Combining earth, fire, water… wait, no, not those elements. To get started you’ll need to collect several assets including a foundation model, training data, text encoders (e.g. tokenizer), and evaluation data.

4.1 Models Are the Foundation

4.2 Evaluating LLMs

4.2.1 Metrics for Evaluating Text

4.2.2 Industry Benchmarks

4.2.3 Responsible AI benchmarks

4.2.4 Develop your own benchmark

4.2.5 Evaluating Code Generators

4.2.6 Evaluating Model Parameters

4.3 Data for LLMs

4.3.1 Datasets You Should Know

4.3.2 Data Cleaning and Preparation

4.4 Text Processors

4.4.1 Tokenization

4.4.2 Embeddings

4.5 Preparing a Slack Dataset

4.6 Summary