chapter two
2 Tokens of thought (natural language words)
This chapter covers
- Parsing your text into words and n-grams (tokens)
- Tokenizing punctuation, emoticons, and even Chinese characters
- Consolidating your vocabulary with stemming, lemmatization, and case folding
- Building a structured numerical representation of natural language text
- Scoring text for sentiment and prosocial intent
- Dealing with variable length sequences of words and tokens
So you want to help save the world with the power of natural language processing (NLP)? First your NLP pipeline will need to compute something about text, and for that you’ll need a way to represent text in a numerical data structure. The part of an NLP pipeline that breaks up your text to create this structured numerical data is called a parser. For many NLP applications, you only need to convert your text to a sequence of words, and that can be enough for searching and classifying text.