3 Math with words: Term frequency–inverse document frequency vectors

 

This chapter covers

  • Counting words, n-grams, and term frequencies to analyze meaning
  • Predicting word occurrence probabilities with Zipf’s law
  • Representing natural language texts as vectors
  • Finding relevant documents in a collection of text using document frequencies
  • Estimating the similarity of pairs of documents with cosine similarity

3.1 Bag-of-words vectors

3.2 Vectorizing text DataFrame constructor

3.2.1 Faster, better, easier token counting

3.2.2 Vectorizing your code

3.2.3 Vector space TF–IDF (term frequency–inverse document frequency)

3.3 Vector distance and similarity

3.3.1 Dot product

3.4 Counting TF–IDF frequencies

3.4.1 Analyzing “this”

3.5 Zipf’s law

3.6 Inverse document frequency

3.6.1 Return of Zipf

3.6.2 Relevance ranking

3.6.3 Smoothing out the math

3.7 Using TF–IDF for your bot

3.8 What’s next

3.9 Test yourself

Summary