Chapter 6. Distributing recommendation computations

This chapter covers

Analyzing a massive data set from Wikipedia
Producing recommendations with Hadoop and distributed algorithms
Pseudo-distributing existing nondistributed recommenders

This book has looked at increasingly large data sets: from 10s of preferences, to 100,000, to 10 million, and then 17 million. But this is still only medium-sized in the world of recommenders. This chapter ups the ante again by tackling a larger data set of 130 million preferences in the form of article-to-article links from Wikipedia’s massive corpus.^[1] In this data set, the articles are both the users and the items, which also demonstrates how recommenders can be usefully applied, with Mahout, to less conventional contexts.

¹ Readers of earlier drafts will recall that the subject of this chapter was the Netflix Prize data set. That data set is no longer officially distributed, for legal reasons, and so is no longer a suitable example data set.

Although 130 million preferences is still a manageable size for demonstration purposes, it’s of such a scale that a single machine would have trouble processing recommendations from it in the way we’ve presented so far. It calls for a new species of recommender algorithm, using a distributed computing approach from Mahout based on the MapReduce paradigm and Apache Hadoop.

6.1. Analyzing the Wikipedia data set

Chapter 6. Distributing recommendation computations

This chapter covers

6.1. Analyzing the Wikipedia data set

6.2. Designing a distributed item-based algorithm

6.3. Implementing a distributed algorithm with MapReduce

6.4. Running MapReduces with Hadoop

6.5. Pseudo-distributing a recommender

6.6. Looking beyond first steps with recommendations

6.7. Summary