appendix C Exercise solutions

 

The complete code examples for the exercises’ answers can be found in the supplementary GitHub repository at https://github.com/rasbt/LLMs-from-scratch.

Chapter 2

Exercise 2.1

You can obtain the individual token IDs by prompting the encoder with one string at a time:

print(tokenizer.encode("Ak"))
print(tokenizer.encode("w"))
# ...

This prints

[33901]
[86]
# ...

You can then use the following code to assemble the original string:

print(tokenizer.decode([33901, 86, 343, 86, 220, 959]))

This returns

'Akwirw ier'

Exercise 2.2

The code for the data loader with max_length=2 and stride=2:

dataloader = create_dataloader(
    raw_text, batch_size=4, max_length=2, stride=2
)

It produces batches of the following format:

tensor([[  40,  367],
        [2885, 1464],
        [1807, 3619],
        [ 402,  271]])

The code of the second data loader with max_length=8 and stride=2:

dataloader = create_dataloader(
    raw_text, batch_size=4, max_length=8, stride=2
)

An example batch looks like

tensor([[   40,   367,  2885,  1464,  1807,  3619,   402,   271],
        [ 2885,  1464,  1807,  3619,   402,   271, 10899,  2138],
        [ 1807,  3619,   402,   271, 10899,  2138,   257,  7026],
        [  402,   271, 10899,  2138,   257,  7026, 15632,   438]])

Chapter 3

Exercise 3.1

The correct weight assignment is

Exercise 3.2