Appendix B. Exercise solutions

 

The complete code examples for the exercises answers can be found in the supplementary GitHub repository at https://github.com/rasbt/reasoning-from-scratch.

B.1 Chapter 2

Exercise 2.1

You can use a prompt similar to "Hello, Ardwarklethyrx. Haus und Garten.", which contains a made-up word ("Ardwarklethyrx") and three words in a non-English language (German):

"Haus und Garten":
prompt = "Hello, Ardwarklethyrx. Haus und Garten."
input_token_ids_list = tokenizer.encode(prompt)
for i in input_token_ids_list:
    print(f"{[i]} --> {tokenizer.decode([i])}")

The output is:

[9707] --> Hello
[11] --> ,
[1644] -->  Ar
[29406] --> dw
[838] --> ark
[273] --> le
[339] --> th
[10920] --> yr
[87] --> x
[13] --> .
[47375] -->  Haus
[2030] -->  und
[93912] -->  Garten
[13] --> .

As we can see, unknown words are broken into smaller pieces of subwords or even single tokens; this allows the tokenizer and LLM to handle any input.

German words are not broken down into characters or even subwords here, suggesting that the tokenizer has seen German texts during training. This also suggests that the LLM was likely trained on German texts, too, and should be able to handle certain non-English languages well.

Exercise 2.2

The updated generate_text_basic function looks like as follows:

Exercise 2.3