Chapter 6. Distributing recommendation computations
This chapter covers
- Analyzing a massive data set from Wikipedia
- Producing recommendations with Hadoop and distributed algorithms
- Pseudo-distributing existing nondistributed recommenders
This book has looked at increasingly large data sets: from 10s of preferences, to 100,000, to 10 million, and then 17 million. But this is still only medium-sized in the world of recommenders. This chapter ups the ante again by tackling a larger data set of 130 million preferences in the form of article-to-article links from Wikipedia’s massive corpus.[1] In this data set, the articles are both the users and the items, which also demonstrates how recommenders can be usefully applied, with Mahout, to less conventional contexts.
1 Readers of earlier drafts will recall that the subject of this chapter was the Netflix Prize data set. That data set is no longer officially distributed, for legal reasons, and so is no longer a suitable example data set.
Although 130 million preferences is still a manageable size for demonstration purposes, it’s of such a scale that a single machine would have trouble processing recommendations from it in the way we’ve presented so far. It calls for a new species of recommender algorithm, using a distributed computing approach from Mahout based on the MapReduce paradigm and Apache Hadoop.