2 Tokens of thought: Natural language words
This chapter covers
- Parsing your text into words and n-grams (tokens)
- Tokenizing punctuation, emoticons, and even Chinese characters
- Consolidating your vocabulary with stemming, lemmatization, and case folding
- Building a structured numerical representation of natural language text
- Scoring text for sentiment and prosocial intent
- Using character frequency analysis to optimize your token vocabulary
- Dealing with variable length sequences of words and tokens
So you want to help save the world with the power of natural language processing (NLP)? No matter what task you want your NLP pipeline to perform, it will need to compute something about text. For that, you’ll need a way to represent text in a numerical data structure. The part of an NLP pipeline that breaks up your text into smaller units and that can be used to represent it numerically is called a tokenizer. A tokenizer breaks unstructured data, natural language text, into chunks of information, which can be counted as discrete elements. These counts of token occurrences in a document can be used directly as a vector representing that document. This immediately turns an unstructured string (text document) into a numerical data structure suitable for machine learning.