3 Model families and architecture variants

 

This chapter covers

  • Typical use cases for decoder-only, and encoder- only transformer architectures
  • Explains encoder-only and decoder-only model architectures
  • Embedding models and their role in retrieval
  • Mixture of Experts architectures for scalable compute

The transformer architecture, in its first decoder-encoder architecture, has proven to be quite versatile, and many architectural variants and model families have evolved from that foundational design. These variations on the basic transformer are strategically selected and engineered for specific tasks such as efficient retrieval, large-scale generation, or scalable compute via expert routing.

We’ll distinguish between decoder-only, and encoder-only models, analyzing how their internal configurations influence their suitability for tasks such as classification, language generation, and translation. Then, we’ll look at some more advanced configurations, such as Mixture of Experts (MoE) models and embedding models.

3.1 Decoder-only Models

3.2 The decoder-only architecture

3.3 Encoder-only Models

3.3.1 Masked Language Modeling as a Pretraining Strategy

3.4 Embedding Models and RAG

3.5 Mixture of experts in large language models

3.6 How MoE works

3.7 Other variations

3.8 Summary