concept Jaccard similarity in category data

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook.
This similarity metric is referred to as the Jaccard similarity, or the Jaccard index.
Figure 13.2. A visualized representation of the Jaccard similarity between two texts.
![]()
The Jaccard similarity between texts 1 and 2 is visually illustrated in Figure 13.2. With the figure, the words the two texts are are represented as two circles. The circle on the left holds all words within text 1. The circle on the right holds all words within text 2. The two circles intersect. The intersection holds all words that are shared between text 1 and 2. The Jaccard similarity equals the fraction of total words that are present in the intersection. Four of the nine words in the diagram appear in the intersection. Therefore, the Jaccard similarity is equal to 4 / 9.
Lets define a function to compute the Jaccard similarity.
Listing 13.9. Computing the Jaccard similarity
def jaccard_similarity(text_a, text_b): word_set_a, word_set_b = [set(simplify_text(text).split()) for text in [text_a, text_b]] num_shared = len(word_set_a & word_set_b) num_total = len(word_set_a | word_set_b) return num_shared / num_total for text in [text2, text3]: similarity = jaccard_similarity(text1, text) print(f"The Jaccard similarity between '{text1}' and '{text}' " f"equals {similarity:.4f}." "\n")The Jaccard similarity between 'She sells seashells by the seashore.' and '"Seashells! The seashells are on sale! By the seashore."' equals 0.4444. The Jaccard similarity between 'She sells seashells by the seashore.' and 'She sells 3 seashells to John, who lives by the lake.' equals 0.4167.Our implementation of the Jaccard similarity is functional, but not very efficient. The function executes two set-comparison operations,
word_set_a & word_set_b
andword_set_a | word_set_b
. These operations compare and contrast all words between two sets. In Python, such comparisons are computationally costlier than streamlined numerical analysis.