Chapter 7. Finding similarities among users and among content

 

Similarity can be calculated in many ways, and we’ll look at most of those. In this chapter

  • You’ll gain an understanding of what similarity and its cousin, distance, are.
  • You’ll look at how to calculate similarity between sets of items.
  • With similarity functions, you’ll measure how alike two users are, using the ratings they’ve given to content.
  • It sometimes helps to group users, so you’ll do that using the k-means clustering algorithm.

Chapter 6 described non-personalized recommendations and the association rules. Association rules are a way to connect content without looking at the item or the users who consumed them. Personalized recommendations, however, almost always contain calculations of similarity. An example of such recommendations could be Netflix’s More Like This recommendation shown in figure 7.1, where it uses an algorithm to find similar content.

Figure 7.1. More Like This personalized recommendations on Netflix based on the TV series The Flash

7.1. Why similarity?

7.1.1. What’s a similarity function?

7.2. Essential similarity functions

7.2.1. Jaccard distance

7.2.2. Measuring distance with Lp-norms

7.2.3. Cosine similarity

7.2.4. Finding similarity with Pearson’s correlation coefficient

7.2.5. Test running a Pearson similarity

7.2.6. Pearson correlation is similar to cosine

7.3. k-means clustering

7.3.1. The k-means clustering algorithm

7.3.2. Translating k-means clustering into Python

7.4. Implementing similarities

7.4.1. Implementing the similarity in the MovieGEEKs site

Summary