appendix-c
appendix C Exercise solutions
The complete code examples for the exercises’ answers can be found in the supplementary GitHub repository at https://github.com/rasbt/LLMs-from-scratch.
Chapter 2
Exercise 2.1
You can obtain the individual token IDs by prompting the encoder with one string at a time:
print(tokenizer.encode("Ak"))
print(tokenizer.encode("w"))
# ...
This prints
[33901] [86] # ...
You can then use the following code to assemble the original string:
print(tokenizer.decode([33901, 86, 343, 86, 220, 959]))
This returns
'Akwirw ier'
Exercise 2.2
dataloader = create_dataloader(
raw_text, batch_size=4, max_length=2, stride=2
)
It produces batches of the following format:
tensor([[ 40, 367],
[2885, 1464],
[1807, 3619],
[ 402, 271]])
The code of the second data loader with max_length=8 and stride=2:
dataloader = create_dataloader(
raw_text, batch_size=4, max_length=8, stride=2
)
An example batch looks like
tensor([[ 40, 367, 2885, 1464, 1807, 3619, 402, 271],
[ 2885, 1464, 1807, 3619, 402, 271, 10899, 2138],
[ 1807, 3619, 402, 271, 10899, 2138, 257, 7026],
[ 402, 271, 10899, 2138, 257, 7026, 15632, 438]])
Chapter 3
Exercise 3.1
The correct weight assignment is