Appendix B. Exercise solutions
The complete code examples for the exercise solutions can be found in the supplementary GitHub repository at https://github.com/rasbt/reasoning-from-scratch.
B.1 Chapter 2
Exercise 2.1
We can use a prompt similar to "Hello, Ardwarklethyrx. Haus und Garten.", which contains a made-up word ("Ardwarklethyrx") and three words in a non-English language (German):
"Haus und Garten":
prompt = "Hello, Ardwarklethyrx. Haus und Garten."
input_token_ids_list = tokenizer.encode(prompt)
for i in input_token_ids_list:
print(f"{[i]} --> {tokenizer.decode([i])}")
The output is:
[9707] --> Hello [11] --> , [1644] --> Ar [29406] --> dw [838] --> ark [273] --> le [339] --> th [10920] --> yr [87] --> x [13] --> . [47375] --> Haus [2030] --> und [93912] --> Garten [13] --> .
As we can see, unknown words are broken into smaller pieces of subwords or even single tokens; this allows the tokenizer and LLM to handle any input.
German words are not broken down into characters or even subwords here, suggesting that the tokenizer has seen German texts during training. This also suggests that the LLM was likely trained on German texts, too, and should be able to handle at least certain non-English languages well.
Exercise 2.2
The updated generate_text_basic function, now called generate_text_basic_stream, looks like as follows: