chapter four

4 Finding meaning in word counts: Semantic analysis

This chapter covers

Analyzing semantics (meaning) to create topic vectors
Semantic search using the semantic similarity between topic vectors
Scalable semantic analysis and semantic search for large corpora
Using semantic components (topics) as features in your NLP pipeline
Navigating high-dimensional vector spaces

Through the first few chapters, you have learned quite a few natural language processing tricks, but now may be the first time you will be able to do a little bit of “magic.” This is the first time we will talk about a machine being able to understand the meanings of words.

The term frequency–inverse document frequency (TF–IDF) vectors you learned about in chapter 3 helped you estimate the importance of words in a chunk of text. You used TF–IDF vectors and matrices to tell you how important each word is to the overall meaning of a bit of text in a document collection. These TF–IDF “importance” scores worked not only for words but also for short sequences of words, n-grams. They are great for searching text if you know the exact words or n-grams you’re looking for, but they also have certain limitations. Often, you need a representation that takes not just counts of words but also their meanings.

4.1 From word counts to topic scores

4.1.1 The limitations of TF–IDF vectors and lemmatization

4.1.2 Topic vectors

4.1.3 Thought experiment

4.1.4 Algorithms for scoring topics

4.2 The challenge: Detecting toxicity

4.2.1 Linear discriminant analysis classifier

4.2.2 Going beyond linear

4.3 Reducing dimensions

4.3.1 Enter principal component analysis

4.3.2 Singular value decomposition

4.4 Latent semantic analysis

4.4.1 Diving into semantics analysis

4.4.2 TruncatedSVD or PCA?

4.4.3 How well does LSA perform for toxicity detection?

4.4.4 Other ways to reduce dimensions

4.5 Latent Dirichlet allocation

4.5.1 The LDiA idea

Summary