chapter four

4 Textual similarity

This chapter covers

Representing data for authorship analysis with deep learning
Applying classifiers to authorship attribution
Understanding the merits of MLPs and CNNs for authorship attribution
Verifying authorship with Siamese networks

One of the most common applications in natural language processing (NLP) is determining whether two texts are similar. Common applications include

Document retrieval—Determining query-result similarity
Topic labeling—Assigning a topic to an unlabeled text based on similarity with a set of labeled texts
Authorship analysis—Determining whether a text is written by a certain author, based on texts attributed to that author

We will approach the topic of text similarity from the perspective of authorship analysis. There are two main topics in authorship analysis:

Authorship attribution—The problem of assigning a text to one of many authors
Authorship verification—The problem of deciding whether a certain text of unknown origin is written by the known author of another text

In this chapter, we will go through a few practical scenarios and investigate techniques for assessing authorship of documents.

4.1 The problem

We start with a general scenario: who, of many potential authors, wrote a particular document?

4.2 The data

4.2.1 Authorship attribution and verification data

4.3 Data representation

4.3.1 Segmenting documents

4.3.2 Word-level information

4.3.3 Subword-level information

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }