11 Deep learning for text

This chapter covers

Preprocessing text data for machine learning applications
Bag-of-words approaches and sequence-modeling approaches for text processing
The Transformer architecture
Sequence-to-sequence learning

11.1 Natural language processing: The bird’s eye view

In computer science, we refer to human languages, like English or Mandarin, as “natural” languages, to distinguish them from languages that were designed for machines, like Assembly, LISP, or XML. Every machine language was designed: its starting point was a human engineer writing down a set of formal rules to describe what statements you could make in that language and what they meant. Rules came first, and people only started using the language once the rule set was complete. With human language, it’s the reverse: usage comes first, rules arise later. Natural language was shaped by an evolution process, much like biological organisms—that’s what makes it “natural.” Its “rules,” like the grammar of English, were formalized after the fact and are often ignored or broken by its users. As a result, while machine-readable language is highly structured and rigorous, using precise syntactic rules to weave together exactly defined concepts from a fixed vocabulary, natural language is messy—ambiguous, chaotic, sprawling, and constantly in flux.

11.2 Preparing text data

11.2.1 Text standardization

11.2.2 Text splitting (tokenization)

11.2.3 Vocabulary indexing

11.2.4 Using the TextVectorization layer

11.3 Two approaches for representing groups of words: Sets and sequences

11.3.1 Preparing the IMDB movie reviews data

11.3.2 Processing words as a set: The bag-of-words approach

11.3.3 Processing words as a sequence: The sequence model approach

11.4 The Transformer architecture

11.4.1 Understanding self-attention

11.4.2 Multi-head attention

11.4.3 The Transformer encoder