chapter two

2 Tokens of thought (natural language words)

This chapter covers

Parsing your text into words and n-grams (tokens)
Tokenizing punctuation, emoticons, and even Chinese characters
Consolidating your vocabulary with stemming, lemmatization, and case folding
Building a structured numerical representation of natural language text
Scoring text for sentiment and prosocial intent
Dealing with variable length sequences of words and tokens

So you want to help save the world with the power of natural language processing (NLP)? First your NLP pipeline will need to compute something about text, and for that you’ll need a way to represent text in a numerical data structure. The part of an NLP pipeline that breaks up your text to create this structured numerical data is called a parser. For many NLP applications, you only need to convert your text to a sequence of words, and that can be enough for searching and classifying text.

2.1 What is a token?

2.1.1 Alternative tokens

2.2 Challenges (a preview of stemming)

2.2.1 Tokenization

2.3 Your tokenizer toolbox

2.3.1 The simplest tokenizer

2.3.2 Rule-based tokenization

2.3.3 SpaCy

2.3.4 Tokenizer race

2.4 Wordpiece tokenizers

2.4.1 Clumping characters

2.5 Vectors of tokens

2.5.1 One-hot Vectors

2.5.2 BOW (Bag-of-Words) Vectors

2.5.3 Dot product

2.6 Challenging tokens

2.6.1 A complicated picture

2.6.2 Extending your vocabulary with n-grams

2.6.3 Normalizing your vocabulary

2.7 Sentiment

2.7.1 VADER—A rule-based sentiment analyzer

2.7.2 Closeness of vectors