chapter three

3 Model families and architecture variants

This chapter covers

Typical use cases for decoder-only and encoder-only transformer architectures
Encoder-only and decoder-only model architectures
Embedding models and their role in retrieval
Mixture of experts architectures for scalable compute

The transformer architecture, in its first encoder–decoder architecture, has proven to be quite versatile, and many architectural variants and model families have evolved from that foundational design. These variations on the basic transformer are strategically selected and engineered for specific tasks such as efficient retrieval, large-scale generation, or scalable compute via expert routing.

We’ll distinguish between decoder-only and encoder-only models, analyzing how their internal configurations influence their suitability for tasks such as classification, language generation, and translation. Then, we’ll look at some more advanced configurations, such as mixture of experts (MoE) models and embedding models.

3.1 Decoder-only models

3.2 The decoder-only architecture

3 Model families and architecture variants

This chapter covers

3.1 Decoder-only models

3.2 The decoder-only architecture

3.3 Encoder-only models

3.3.1 Masked language modeling as a pretraining strategy

3.4 Embedding models and RAG

3.4.1 What is an embedding?

3.5 MoE in LLMs

3.5.1 How MoE works

Summary