concept `Tanimoto similarity` in category `data`

appears as: Timoto similarity, The Timoto similarity, Timoto similarities

Data Science Bookcamp: Five Python Projects MEAP V04 livebook

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook. Login to get full access to this book.

We are comparing Normalized Query Vector and Normalized Title A vector
The Tanimoto similarity between vectors is 1.0000
The cosine similarity between vectors is 1.0000
The Euclidean distance between vectors is 0.0000
The angle between vectors is 0.0000 degrees

We are comparing Normalized Query Vector and Title B Vector
The Tanimoto similarity between vectors is 0.5469
The cosine similarity between vectors is 0.7071
The Euclidean distance between vectors is 0.7654
The angle between vectors is 45.0000 degrees

to see more go to 13 Measuring Text Similarities

Figure 13.12. Transforming our three texts into a normalized matrix. The initial texts appear in the upper-left corner of the figure. These texts share a vocabulary of 15 unique words. We leverage the vocabulary to transform the texts into a matrix of word-counts. This count matrix appears in the upper-right corner of the figure. Its three rows correspond to the three texts. Its 15 columns track the word-occurrence count of every word within each text. We’ll now normalize these counts, by dividing each row by its magnitude. The normalization will produce matrix in the lower-right corner of the figure. The dot product between any two rows in the normalized matrix will equal the cosine similarity between the corresponding texts. Subsequently, running cos / (2 - cos) will transform the cosine similarity into the Tanimoto similarity.
Listing 13.32. Computing a table of normalized Tanimoto similarities
num_texts = len(tf_vectors)
similarities = np.array([[0.0] * num_texts for _ in range(num_texts)]) #1
similarities = np.zeros((num_texts, num_texts))
unit_vectors = np.array([vector / norm(vector) for vector in tf_vectors])
for i, vector_a in enumerate(unit_vectors):
    for j, vector_b in enumerate(unit_vectors):
        similarities[i][j] = normalized_tanimoto(vector_a, vector_b)

labels = ['Text 1', 'Text 2', 'Text 3']
sns.heatmap(similarities,  cmap='YlGnBu', annot=True,
            xticklabels=labels, yticklabels=labels)
plt.yticks(rotation=0)
plt.show()
copy
Figure 13.13. A table of normalized of Tanimoto similarities across text-pairs. The table’s diagonal represents the similarity between each text and itself. Not surprisingly, that similarity is 1. Ignoring the diagonal, we see that texts 1 and 2 share the highest similarity. Meanwhile, texts 2 and 3 share the lowest similarity.

to see more go to 13 Measuring Text Similarities

concept Tanimoto similarity in category data

Data Science Bookcamp: Five Python Projects MEAP V04 livebook

Listing 13.32. Computing a table of normalized Tanimoto similarities

Unable to load book!

concept `Tanimoto similarity` in category `data`