6 Linguistic feature engineering for author profiling

 

This chapter covers

  • Improving the implementation of your user profiling algorithm
  • Discovering strategies for linguistic feature engineering
  • Exploring other useful NLP techniques with NLTK and spaCy
  • Applying a Decision Tree classifier with sklearn
  • Evaluating a machine-learning classifier in application to an NLP task

The last chapter introduced the task of author (user) profiling and focused on authorship identification. We said that it is a good example of how machine learning can be applied to build an NLP application. This works because

  • We can clearly define classes for this task. In particular, you were detecting which of the two authors, Jane Austen (class1) or William Shakespeare (class2), produced a piece of writing. This is a binary task, as there are two classes to distinguish between.
  • We can get good-quality data to work with. Chapter 5 showed how you could access literary texts using NLTK’s interface to Project Gutenberg. Literary works by famous writers are widely and often freely available, and we can rely on the author assignment in this data—there is no doubt as to who the author of Macbeth or Sense and Sensibility is.
  • We can define features. For instance, one of the strongest characteristics of individual writing style is the selection of words, as we all have our own favorite words that we tend to use more frequently than other people around us.

6.1 Another close look at the machine-learning pipeline

6.1.1 Evaluating the performance of your classifier

6.1.2 Further evaluation measures

6.2 Feature engineering for authorship attribution

6.2.1 Word and sentence length statistics as features

6.2.2 Counts of stopwords and proportion of stopwords as features

6.2.3 Distributions of parts of speech as features

6.2.4 Distribution of word suffixes as features

6.2.5 Unique words as features