concept Tanimoto similarity in category data

appears as: Timoto similarity, The Timoto similarity, Timoto similarities
Data Science Bookcamp: Five Python Projects MEAP V04 livebook

This is an excerpt from Manning's book Data Science Bookcamp: Five Python Projects MEAP V04 livebook.

We are comparing Normalized Query Vector and Normalized Title A vector
The Tanimoto similarity between vectors is 1.0000
The cosine similarity between vectors is 1.0000
The Euclidean distance between vectors is 0.0000
The angle between vectors is 0.0000 degrees

We are comparing Normalized Query Vector and Title B Vector
The Tanimoto similarity between vectors is 0.5469
The cosine similarity between vectors is 0.7071
The Euclidean distance between vectors is 0.7654
The angle between vectors is 45.0000 degrees
Figure 13.12. Transforming our three texts into a normalized matrix. The initial texts appear in the upper-left corner of the figure. These texts share a vocabulary of 15 unique words. We leverage the vocabulary to transform the texts into a matrix of word-counts. This count matrix appears in the upper-right corner of the figure. Its three rows correspond to the three texts. Its 15 columns track the word-occurrence count of every word within each text. We’ll now normalize these counts, by dividing each row by its magnitude. The normalization will produce matrix in the lower-right corner of the figure. The dot product between any two rows in the normalized matrix will equal the cosine similarity between the corresponding texts. Subsequently, running cos / (2 - cos) will transform the cosine similarity into the Tanimoto similarity.
fig13 12
Listing 13.32. Computing a table of normalized Tanimoto similarities
num_texts = len(tf_vectors)
similarities = np.array([[0.0] * num_texts for _ in range(num_texts)]) #1
similarities = np.zeros((num_texts, num_texts))
unit_vectors = np.array([vector / norm(vector) for vector in tf_vectors])
for i, vector_a in enumerate(unit_vectors):
    for j, vector_b in enumerate(unit_vectors):
        similarities[i][j] = normalized_tanimoto(vector_a, vector_b)

labels = ['Text 1', 'Text 2', 'Text 3']
sns.heatmap(similarities,  cmap='YlGnBu', annot=True,
            xticklabels=labels, yticklabels=labels)
plt.yticks(rotation=0)
plt.show()
Figure 13.13. A table of normalized of Tanimoto similarities across text-pairs. The table’s diagonal represents the similarity between each text and itself. Not surprisingly, that similarity is 1. Ignoring the diagonal, we see that texts 1 and 2 share the highest similarity. Meanwhile, texts 2 and 3 share the lowest similarity.
fig13 13
sitemap

Unable to load book!

The book could not be loaded.

(try again in a couple of minutes)

manning.com homepage