One of the most common applications in natural language processing (NLP) is determining whether two texts are similar. Common applications include
- Document retrieval—Determining query-result similarity
- Topic labeling—Assigning a topic to an unlabeled text based on similarity with a set of labeled texts
- Authorship analysis—Determining whether a text is written by a certain author, based on texts attributed to that author
We will approach the topic of text similarity from the perspective of authorship analysis. There are two main topics in authorship analysis: