4 Textual similarity

 

This chapter covers

  • Representing data for authorship analysis with deep learning
  • Applying classifiers to authorship attribution
  • Understanding the merits of MLPs and CNNs for authorship attribution
  • Verifying authorship with Siamese networks

One of the most common applications in natural language processing (NLP) is determining whether two texts are similar. Common applications include

  • Document retrieval—Determining query-result similarity
  • Topic labeling—Assigning a topic to an unlabeled text based on similarity with a set of labeled texts
  • Authorship analysis—Determining whether a text is written by a certain author, based on texts attributed to that author

We will approach the topic of text similarity from the perspective of authorship analysis. There are two main topics in authorship analysis:

  • Authorship attribution—The problem of assigning a text to one of many authors
  • Authorship verification—The problem of deciding whether a certain text of unknown origin is written by the known author of another text

In this chapter, we will go through a few practical scenarios and investigate techniques for assessing authorship of documents.

4.1 The problem

We start with a general scenario: who, of many potential authors, wrote a particular document?

4.2 The data

4.2.1 Authorship attribution and verification data

4.3 Data representation

4.3.1 Segmenting documents

4.3.2 Word-level information

4.3.3 Subword-level information