The last chapter introduced the task of author (user) profiling and focused on authorship identification. We said that it is a good example of how machine learning can be applied to build an NLP application. This works because
- We can clearly define classes for this task. In particular, you were detecting which of the two authors, Jane Austen (class1) or William Shakespeare (class2), produced a piece of writing. This is a binary task, as there are two classes to distinguish between.
- We can get good-quality data to work with. Chapter 5 showed how you could access literary texts using NLTK’s interface to Project Gutenberg. Literary works by famous writers are widely and often freely available, and we can rely on the author assignment in this data—there is no doubt as to who the author of Macbeth or Sense and Sensibility is.
- We can define features. For instance, one of the strongest characteristics of individual writing style is the selection of words, as we all have our own favorite words that we tend to use more frequently than other people around us.