chapter three
3 Model families and architecture variants
This chapter covers
- Typical use cases for decoder-only and encoder-only transformer architectures
- Encoder-only and decoder-only model architectures
- Embedding models and their role in retrieval
- Mixture of experts architectures for scalable compute
The transformer architecture, in its first encoder–decoder architecture, has proven to be quite versatile, and many architectural variants and model families have evolved from that foundational design. These variations on the basic transformer are strategically selected and engineered for specific tasks such as efficient retrieval, large-scale generation, or scalable compute via expert routing.
We’ll distinguish between decoder-only and encoder-only models, analyzing how their internal configurations influence their suitability for tasks such as classification, language generation, and translation. Then, we’ll look at some more advanced configurations, such as mixture of experts (MoE) models and embedding models.