concept cosine similarity in category machine learning
appears as: cosine similarity, cosine similarity

This is an excerpt from Manning's book Human-in-the-Loop Machine Learning MEAP V09.
Figure 4.5: An example of a clustering algorithm using cosine similarity. For each cluster, the center is defined as a vector from 0 and the membership of that cluster is the angle between the vector representing the cluster and the vector representing the item.
![]()
You can think of Cosine similarity in terms of looking at stars in the night sky. If you drew a straight line from yourself towards two stars, and measure the angle between those lines, then that angle is used to give the cosine similarity. In the night-time sky example, there are only the three physical dimensions, but in your data there is one dimension for each feature. Cosine similarity is not immune to the problems of high dimensionality, but tends to perform better than Euclidean distance especially for sparse data, like our text encodings.
def get_cluster_samples(self, data, num_clusters=5, max_epochs=5, limit=5000): """Create clusters using cosine similarity Keyword arguments: data -- data to be clustered num_clusters -- the number of clusters to create max_epochs -- maximum number of epochs to create clusters limit -- sample only this many items for faster clustering (-1 = no limit) Creates clusters by the K-Means clustering algorithm, using cosine similarity instead of more common euclidean distance Creates clusters until converged or max_epochs passes over the data """ if limit > 0: shuffle(data) data = data[:limit] cosine_clusters = CosineClusters(num_clusters) cosine_clusters.add_random_training_items(data) #A for i in range(0, max_epochs): print("Epoch "+str(i)) added = cosine_clusters.add_items_to_best_cluster(data) #B if added == 0: break centroids = cosine_clusters.get_centroids() #C outliers = cosine_clusters.get_outliers() #D randoms = cosine_clusters.get_randoms(3, verbose) #E return centroids + outliers + randoms A: Initialize clusters with random assignments B: Move each item to the cluster that it is the best fit for, and repeat C: Sample the best-fit (centroid) from each cluster D: Sample the biggest outlier in each cluster E: Sample three random items from each cluster, and pass the “verbose” parameter to get an intuition for what is in each cluster.