chapter nine

9 Creating an LLM project: Reimplementing Llama 3

This chapter covers

Implementing Meta’s Llama3 model
Training a simple LLM
Making improvements to it to prepare it for production
Serving the model to a production endpoint you can share with your friends

I am only coming to Princeton to research, not to teach. There is too much education altogether, especially in American schools. The only rational way of educating is to be an example.
—Albert Einstein

For the first major project in the book, we want to start from scratch. We’ve been showing you how to work with LLMs from end to end, and we are going to put it all together in this chapter. This project includes pretraining a model, roughly following a research paper. We won’t dive too deeply into the actual research; in fact, we’ll take several shortcuts here, as this isn’t the focus of this book. We will, however, showcase how to train the model, prepare it for servings with quantization, finetune it with low-rank adaptation (LoRA) for a specific purpose or task, and deploy it to a production environment you can showcase to your friends.

This chapter will be very dense, but you should be more than prepared to meet the challenge at this point because it’s mainly a data scientist–focused project for production. We chose this project so that you can put all the lessons you’ve learned throughout the book together into one place and leave you with end-to-end, hands-on experience.

9.1 Implementing Meta’s Llama

9.1.1 Tokenization and configuration

9 Creating an LLM project: Reimplementing Llama 3

This chapter covers

9.1 Implementing Meta’s Llama

9.1.1 Tokenization and configuration

9.1.2 Dataset, data loading, evaluation, and generation

9.1.3 Network architecture

9.2 Simple Llama

9.3 Making it better

9.3.1 Quantization

9.3.2 LoRA

9.3.3 Fully sharded data parallel–quantized LoRA

9.4 Deploy to a Hugging Face Hub Space

Summary