chapter thirteen

13 Measuring Text Similarities

 

This section covers:

  • What is Natural Language Processing?
  • Comparing texts based on world overlap
  • Comparing texts using 1-dimensional arrays called vectors
  • Comparing texts using 2-dimensional arrays called matrices
  • Efficient matrix compututation using NumPy

Rapid text analysis is able to save lives. Consider this actual real-world incident, when US soldiers stormed a terrorist compound. In the compound, they discovered a computer containing terabytes of archived data. The data included documents, text-messages, and e-mails pertaining to terrorist activities. The documents were too numerous to be read by any single human being. Fortunately, the soldiers were equipped with special software for very fast analysis of text. The software allowed the soldiers process all text-data without even having to leave the compound. The onsite analysis immediately revealed an active terrorist plot in a nearby neighborhood. The soldiers instantaneously responded to the plot, and prevented a terrorist attack.

This swift defensive response would not have been possible without NLP techniques. NLP stands for Natural Language Processing; a branch of data science that focuses on speedy text analysis. Typically, NLP is applied to very large text datasets. NLP use cases are numerous and diverse. They include:

13.1  Simple Text Comparison

13.1.1  Introduction to the Jaccard Similarity

13.1.2  Replacing Words with Numeric Values

13.2  Vectorizing Texts Using Word Counts

13.2.1  Using Normalization to Improve TF Vector Similarity

13.3  Matrix Multiplication for Efficient Similarity Calculation

13.3.1  Basic Matrix Operations

13.3.2  Computing All-By-All Matrix Similarities

13.4  Computational Limits of Matrix Multiplication

13.5  Summary