concept `Jaccard similarity` in category `data`

appears as: Jaccard similarity, The Jaccard similarity, Jaccard Similarity, The Jaccard similarity

Data Science Bookcamp: Five Python Projects MEAP V04 livebook

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook. Login to get full access to this book.

This similarity metric is referred to as the Jaccard similarity, or the Jaccard index.

Figure 13.2. A visualized representation of the Jaccard similarity between two texts.

The Jaccard similarity between texts 1 and 2 is visually illustrated in Figure 13.2. With the figure, the words the two texts are are represented as two circles. The circle on the left holds all words within text 1. The circle on the right holds all words within text 2. The two circles intersect. The intersection holds all words that are shared between text 1 and 2. The Jaccard similarity equals the fraction of total words that are present in the intersection. Four of the nine words in the diagram appear in the intersection. Therefore, the Jaccard similarity is equal to 4 / 9.

to see more go to 13 Measuring Text Similarities

Lets define a function to compute the Jaccard similarity.
Listing 13.9. Computing the Jaccard similarity
def jaccard_similarity(text_a, text_b):
    word_set_a, word_set_b = [set(simplify_text(text).split())
                              for text in [text_a, text_b]]
    num_shared = len(word_set_a & word_set_b)
    num_total = len(word_set_a | word_set_b)
    return num_shared / num_total

for text in [text2, text3]:
    similarity = jaccard_similarity(text1, text)
    print(f"The Jaccard similarity between '{text1}' and '{text}' "
          f"equals {similarity:.4f}." "\n")
The Jaccard similarity between 'She sells seashells by the seashore.' and '"Seashells! The seashells are on sale! By the seashore."' equals 0.4444.

The Jaccard similarity between 'She sells seashells by the seashore.' and 'She sells 3 seashells to John, who lives by the lake.' equals 0.4167.
Our implementation of the Jaccard similarity is functional, but not very efficient. The function executes two set-comparison operations, word_set_a & word_set_b and word_set_a | word_set_b. These operations compare and contrast all words between two sets. In Python, such comparisons are computationally costlier than streamlined numerical analysis.

to see more go to 13 Measuring Text Similarities

concept Jaccard similarity in category data

Data Science Bookcamp: Five Python Projects MEAP V04 livebook

Figure 13.2. A visualized representation of the Jaccard similarity between two texts.

Listing 13.9. Computing the Jaccard similarity

Unable to load book!

concept `Jaccard similarity` in category `data`