chapter two

2 Tokens of thought: Natural language words

This chapter covers

Parsing your text into words and n-grams (tokens)
Tokenizing punctuation, emoticons, and even Chinese characters
Consolidating your vocabulary with stemming, lemmatization, and case folding
Building a structured numerical representation of natural language text
Scoring text for sentiment and prosocial intent
Using character frequency analysis to optimize your token vocabulary
Dealing with variable length sequences of words and tokens

So you want to help save the world with the power of natural language processing (NLP)? No matter what task you want your NLP pipeline to perform, it will need to compute something about text. For that, you’ll need a way to represent text in a numerical data structure. The part of an NLP pipeline that breaks up your text into smaller units and that can be used to represent it numerically is called a tokenizer. A tokenizer breaks unstructured data, natural language text, into chunks of information, which can be counted as discrete elements. These counts of token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning.

2.1 Tokens and tokenization

2.1.1 Your tokenizer toolbox

2.1.2 The simplest tokenizer

2.1.3 Rule-based tokenization

2.1.4 SpaCy

2.1.5 Finding the fastest word tokenizer

2.2 Beyond word tokens

2.2.1 WordPiece tokenizers

2.3 Improving your vocabulary

2.3.1 Extending your vocabulary with n-grams

2.3.2 Normalizing your vocabulary

2.4 Challenging tokens: Processing logographic languages

2.4.1 A complicated picture: Lemmatization and stemming in Chinese

2.5 Vectors of tokens

2.5.1 One-hot vectors

2.5.2 Bag-of-words vectors

2.5.3 Why not bag of characters?

2.6 Sentiment

2.6.1 VADER: A rule-based sentiment analyzer

2.6.2 Naive Bayes