13 Measuring text similarities

 

This section covers

  • What is natural language processing?
  • Comparing texts based on word overlap
  • Comparing texts using one-dimensional arrays called vectors
  • Comparing texts using two-dimensional arrays called matrices
  • Efficient matrix computation using NumPy

Rapid text analysis can save lives. Let’s consider a real-world incident when US soldiers stormed a terrorist compound. In the compound, they discovered a computer containing terabytes of archived data. The data included documents, text messages, and emails pertaining to terrorist activities. The documents were too numerous to be read by any single human being. Fortunately, the soldiers were equipped with special software that could perform very fast text analysis. The software allowed the soldiers to process all of the text data without even having to leave the compound. The onsite analysis immediately revealed an active terrorist plot in a nearby neighborhood. The soldiers instantly responded to the plot and prevented a terrorist attack.

This swift defensive response would not have been possible without natural language processing (NLP) techniques. NLP is a branch of data science that focuses on speedy text analysis. Typically, NLP is applied to very large text datasets. NLP use cases are numerous and diverse and include the following:

13.1 Simple text comparison

 
 

13.1.1 Exploring the Jaccard similarity

 
 

13.1.2 Replacing words with numeric values

 
 

13.2 Vectorizing texts using word counts

 
 
 

13.2.1 Using normalization to improve TF vector similarity

 

13.2.2 Using unit vector dot products to convert between relevance metrics

 
 
 
 

13.3 Matrix multiplication for efficient similarity calculation

 
 
 

13.3.1 Basic matrix operations

 
 
 

13.3.2 Computing all-by-all matrix similarities

 
 
 
 

13.4 Computational limits of matrix multiplication

 
 
 

Summary

 
 
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage