In this and the next chapter, you will build your own algorithm that can identify the profile or even the precise identity of an anonymous author of a text based solely on their writing. As you will find out over the next two chapters, this task brings together several useful NLP concepts and techniques that were introduced in the previous chapters. You’ve learned that
- Tokenizers can be applied to split text into individual words.
- Words may be meaningful, or they may simply express some function (e.g., linking other, meaningful words together). In this case, they are called stopwords, and for certain NLP applications you will need to remove them.
- Words are further classified into nouns, verbs, adjectives, and so on, depending on their function. Each of such classes is assigned a part-of-speech tag, which can be identified automatically with a POS tagger.
- Words of different functions play different roles in a sentence, and these roles and relations between words with different functions can be identified with a dependency parser.
- Words are formed of lemmas and stems, and you can use lemmatizers and stemmers to detect those.