Chapter 10. Content-based filtering

 

This chapter is all about content and users’ tastes:

  • You’ll be introduced to content-based filtering.
  • You’ll learn how to construct user and content profiles.
  • You’ll learn to extract information from descriptions using term fequency-inverse document frequency (TF-IDF) and latent Dirichlet allocation (LDA) to create content profiles.
  • You’ll implement content-based filtering using descriptions of films in MovieGEEKs site.

In previous chapters, you saw that it’s possible to create recommendations by focusing only on the interactions between users and content (for example, shopping basket analysis or collaborative filtering). Although those work nicely, what about the things that you know about the content? For a movie that can include categories such as genres, actors, and directors. In other sites, it can be things such as clothing sizes and colors, or engine sizes for cars. Can you call a recommender system good if it doesn’t take those things into account?

The answer is “YES!” as you’ve seen in the previous chapters, but it still seems as if you’re missing something or losing out on certain information. I’ll try to make up for that because this chapter covers what you know about content and users’ tastes.

10.1. Descriptive example

10.2. Content-based filtering

10.3. Content analyzer

10.3.1. Feature extraction for the item profile

10.3.2. Categorical data with small numbers

10.3.3. Converting the year to a comparable feature

10.4. Extracting metadata from descriptions

10.4.1. Preparing descriptions

10.5. Finding important words with TF-IDF

10.6. Topic modeling using the LDA

Generative model example

Generating the topics

Gibbs sampling

LDA model

The corpus Wikipedia

Adding features and tags to documents