4 Finding meaning in word counts (semantic analysis)

 

This chapter covers

  • Analyzing semantics (meaning) to create topic vectors
  • Semantic search using the similarity between topic vectors
  • Scalable semantic analysis and semantic search for large corpora
  • Using semantic components (topics) as features in your NLP pipeline
  • Navigating high-dimensional vector spaces

You’ve learned quite a few natural language processing tricks. But now may be the first time you’ll be able to do a little bit of magic. This is the first time we talk about a machine being able to understand the “meaning” of words.

The TF-IDF vectors (term frequency–inverse document frequency vectors) from chapter 3 helped you estimate the importance of words in a chunk of text. You used TF-IDF vectors and matrices to tell you how important each word is to the overall meaning of a bit of text in a document collection.

These TF-IDF “importance” scores worked not only for words, but also for short sequences of words, n-grams. These importance scores for n-grams are great for searching text if you know the exact words or n-grams you’re looking for.

Past NLP experimenters found an algorithm for revealing the meaning of word combinations and computing vectors to represent this meaning. It’s called latent semantic analysis (LSA). And when you use this tool, not only can you represent the meaning of words as vectors, but you can use them to represent the meaning of entire documents.

4.1 From word counts to topic scores

4.1.1 TF-IDF vectors and lemmatization

4.1.2 Topic vectors

4.1.3 Thought experiment

4.1.4 An algorithm for scoring topics

4.1.5 An LDA classifier

4.2 Latent semantic analysis

4.2.1 Your thought experiment made real

4.3 Singular value decomposition

4.3.1 U—left singular vectors

4.3.2 S—singular values

4.3.3 VT—right singular vectors

sitemap