chapter two

2 Tokenizers: How large language models see the world

This chapter covers

Creating tokens from sentences
Controlling vocabulary size with normalization
Avoiding risks in tokenization
Tokenization strategies to remove ambiguity

As discussed in chapter 1, in the world of artificial intelligence, it is often helpful to find analogies to human learning to explain how machines “learn.” How you read and understand sentences is a complex process that changes as you get older and involves multiple sequential and concurrent cognitive processes [1]. Large language models (LLMs), however, use simpler processes than human cognitive processes. They employ algorithms based on neural networks to capture the relationships between words in large amounts of data and then use this information about relationships to interpret and generate sentences.

Our discussion of how these algorithms work will begin with their input: sentences of text. In this chapter, we explore how the LLM processes these sentences to become inputs for the model. Just as language is critical for how you think and process information, the inputs to an LLM are crucial in influencing what kinds of concepts and tasks LLMs can perform.

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2 Tokenizers: How large language models see the world

This chapter covers

2.1 Tokens as numeric representations

2.2 Language models see only tokens

2.2.1 The tokenization process

2.2.2 Controlling vocabulary size in tokenization

2.2.3 Tokenization in detail

2.2.4 The risks of tokenization

2.3 Tokenization and LLM capabilities

2.3.1 LLMs are bad at word games

2.3.2 LLMs are challenged by mathematics

2.3.3 LLMs and language equity

2.4 Check your understanding

2.5 Tokenization in context

Summary