2 Build your vocabulary (word tokenization)

This chapter covers

Tokenizing your text into words and n-grams (tokens)
Dealing with nonstandard punctuation and emoticons, like social media posts
Compressing your token vocabulary with stemming and lemmatization
Building a vector representation of a statement
Building a sentiment analyzer from handcrafted token scores

So you’re ready to save the world with the power of natural language processing? Well the first thing you need is a powerful vocabulary. This chapter will help you split a document, any string, into discrete tokens of meaning. Our tokens are limited to words, punctuation marks, and numbers, but the techniques we use are easily extended to any other units of meaning contained in a sequence of characters, like ASCII emoticons, Unicode emojis, mathematical symbols, and so on.

2.1 Challenges (a preview of stemming)

2.2 Building your vocabulary with a tokenizer

2.2.1 Dot product

2.2.2 Measuring bag-of-words overlap

2.2.3 A token improvement

2.2.4 Extending your vocabulary with n-grams

2.2.5 Normalizing your vocabulary

2.3 Sentiment

2.3.1 VADER—A rule-based sentiment analyzer

2.3.2 Naive Bayes

Summary