chapter three

3 Math with words: Term frequency–inverse document frequency vectors

This chapter covers

Counting words, n-grams, and term frequencies to analyze meaning
Predicting word occurrence probabilities with Zipf’s law
Representing natural language texts as vectors
Finding relevant documents in a collection of text using document frequencies
Estimating the similarity of pairs of documents with cosine similarity

3.1 Bag-of-words vectors

3.2 Vectorizing text DataFrame constructor

3.2.1 Faster, better, easier token counting

3.2.2 Vectorizing your code

3.2.3 Vector space TF–IDF (term frequency–inverse document frequency)

3.3 Vector distance and similarity

3.3.1 Dot product

3.4 Counting TF–IDF frequencies

3.4.1 Analyzing “this”

3.5 Zipf’s law

3.6 Inverse document frequency

3.6.1 Return of Zipf

3.6.2 Relevance ranking

3.6.3 Smoothing out the math

3.7 Using TF–IDF for your bot

3.8 What’s next

3.9 Test yourself

Summary

@font-face { font-family: 'livebook'; src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0'); src:url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.eot?1.9.0') format('embedded-opentype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.woff?1.9.0') format('woff'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.ttf?1.9.0') format('truetype'), url('https://d19npu3b8zepp3.cloudfront.net/assets/fonts/livebook.svg?1.9.0') format('svg'); font-weight: normal; font-style: normal; }