Chapter 3. Representing recommender data
This chapter covers
- How Mahout represents recommender data
- DataModel implementations and usage
- Handling data without preference values
The quality of recommendations is largely determined by the quantity and quality of data. “Garbage in, garbage out,” has never been more true than here. Having high-quality data is a good thing, and generally, having lots of it is also good.
Recommender algorithms are data-intensive by nature; their computations access a great deal of information. Runtime performance is therefore greatly affected by the quantity of data and its representation. Intelligently choosing data structures can affect performance by orders of magnitude, and, at scale, it matters a lot.
This chapter explores Mahout’s key classes for representing and accessing recommender-related data. You’ll get a better sense of why users and items, and their associated preferences, are represented the way they are in Mahout for efficiency and scalability. This chapter also looks in detail at the key abstraction in Mahout that provides access to this data: a DataModel.
Finally, we look at problems and opportunities that arise when user and item data has no concept of ratings or preferences values—the so-called Boolean preferences, which require special handling.
The first section introduces the basic unit of recommender data, a user-item preference.