5 Author profiling as a machine-learning task

 

This chapter covers

  • Implementing your user profiling algorithm
  • Exploring NLP techniques with NLTK and spaCy
  • Introducing scikit-learn
  • Applying Decision Trees machine-learning classifier

In this and the next chapter, you will build your own algorithm that can identify the profile or even the precise identity of an anonymous author of a text based solely on their writing. As you will find out over the next two chapters, this task brings together several useful NLP concepts and techniques that were introduced in the previous chapters. You’ve learned that

  • Tokenizers can be applied to split text into individual words.
  • Words may be meaningful, or they may simply express some function (e.g., linking other, meaningful words together). In this case, they are called stopwords, and for certain NLP applications you will need to remove them.
  • Words are further classified into nouns, verbs, adjectives, and so on, depending on their function. Each of such classes is assigned a part-of-speech tag, which can be identified automatically with a POS tagger.
  • Words of different functions play different roles in a sentence, and these roles and relations between words with different functions can be identified with a dependency parser.
  • Words are formed of lemmas and stems, and you can use lemmatizers and stemmers to detect those.

5.1 Understanding the task

5.1.1 Case 1: Authorship attribution

5.1.2 Case 2: User profiling

5.2 Machine-learning pipeline at first glance

5.2.1 Original data

5.2.2 Testing generalization behavior

5.2.3 Setting up the benchmark

5.3 A closer look at the machine-learning pipeline

5.3.1 Decision Trees classifier basics

Summary