3 Math with words (TF-IDF vectors)


This chapter covers

  • Counting words and term frequencies to analyze meaning
  • Predicting word occurrence probabilities with Zipf’s Law
  • Vector representation of words and how to start using them
  • Finding relevant documents from a corpus using inverse document frequencies
  • Estimating the similarity of pairs of documents with cosine similarity and Okapi BM25

Having collected and counted words (tokens), and bucketed them into stems or lemmas, it’s time to do something interesting with them. Detecting words is useful for simple tasks, like getting statistics about word usage or doing keyword search. But you’d like to know which words are more important to a particular document and across the corpus as a whole. Then you can use that “importance” value to find relevant documents in a corpus based on keyword importance within each document.

3.1 Bag of words

3.2 Vectorizing

3.2.1 Vector spaces

3.3 Zipf’s Law

3.4 Topic modeling

3.4.1 Return of Zipf

3.4.2 Relevance ranking

3.4.3 Tools

3.4.4 Alternatives

3.4.5 Okapi BM25

3.4.6 What’s next