10 Training a Transformer to translate English to French

This chapter covers

Tokenizing English and French phrases to subwords
Understanding word embedding and positional encoding
Training a Transformer from scratch to translate English to French
Using the trained Transformer to translate an English phrase into French

In the last chapter, we built a Transformer from scratch that can translate between any two languages, based on the paper “Attention Is All You Need.”¹ Specifically, we implemented the self-attention mechanism, using query, key, and value vectors to calculate scaled dot product attention (SDPA).

To have a deeper understanding of self-attention and Transformers, we’ll use English-to-French translation as our case study in this chapter. By exploring the process of training a model for converting English sentences into French, you will gain a deep understanding of the Transformer’s architecture and the functioning of the attention mechanism.

10.1 Subword tokenization

10.1.1 Tokenizing English and French phrases

10.1.2 Sequence padding and batch creation

10.2 Word embedding and positional encoding

10.2.1 Word embedding

10.2.2 Positional encoding

10.3 Training the Transformer for English-to-French translation

10.3.1 Loss function and the optimizer

10.3.2 The training loop

10.4 Translating English to French with the trained model

Summary