This chapter covers:
- Preprocessing text data for machine learning applications
- Bag-of-word approaches and sequence-modeling approaches for text processing
- The Transformer architecture
- Sequence-to-sequence learning
In computer science, we refer to human languages, like English or Mandarin, as "natural" languages, to distinguish them from languages that were designed for machines, like Assembly, LISP, or XML. Every machine language was designed: its starting point was a human engineer writing down a set of formal rules to describe what statements you could make in that language, and what they meant. Rules came first, and people only started using the language once the rule set was complete. With human language, it’s the reverse: usage comes first, rules arise later. Natural language was shaped by an evolution process, much like biological organisms—that’s what makes it "natural". Its "rules", like the grammar of English, were formalized after the fact, and are often ignored or broken by its users. As a result, while machine-readable language is highly structured and rigorous, using precise syntactic rules to weave together exactly-defined concepts from a fixed vocabulary, natural language is messy—ambiguous, chaotic, sprawling, and constantly in flux.