chapter five

5 Author Profiling as a Machine Learning Task

 

This chapter covers:

  • Implementation of your own author (user) profiling algorithm
  • Further useful NLP techniques with NLTK and spaCy
  • Introduction into sklearn
  • Application of a new machine learning classifier, Decision Trees

In this and the next chapter you will build your own algorithm that can identify the profile or even the precise identify of an anonymous author of a text based solely on their writing. As you will find out in the course of these two chapters, this task brings together a number of useful NLP concepts and techniques that were introduced in the previous chapters. You’ve learned that:

  • tokenizers can be applied to split text into individual words;
  • words may be meaningful or they may simply express some function, e.g. linking other, meaningful words together – in this case, they are called stopwords, and for certain NLP applications you will need to remove them;
  • depending on their function, words are further classified into nouns, verbs, adjectives and so on; each of such classes is assigned a part-of-speech tag, which can be identified automatically with a part-of-speech tagger;
  • words of different functions play different roles in a sentence, and these roles and relations between words with different functions can be identified with a dependency parser;
  • words are formed of lemmas and stems, and you can use lemmatizers and stemmers to detect those.

5.1       Understanding the task

5.2       Machine Learning pipeline at a first glance

5.2.1   Original data

5.2.2   Testing generalization behavior

5.2.3   Setting up the benchmark

5.3       A closer look at the machine learning pipeline

5.3.1   Decision Trees classifier basics

5.3.2   Evaluating which tree is better using node impurity

5.3.3   Selection of the best split in Decision Trees

5.3.4   Decision Trees on language data

5.4       Summary