Chapter 12. Real-world applications of clustering

 

This chapter covers

  • Clustering like-minded people on Twitter
  • Suggesting tags for an artist on Last.fm using clustering
  • Creating a related-posts feature for a website

You probably picked up this book to learn and understand how clustering can be applied to real-world problems. So far we’ve mostly focused on clustering the Reuter’s news data set, which had around 20,000 documents, each having about 1,000 to 2,000 words. The size of that data set isn’t challenging enough for Mahout to show its ability to scale. In this chapter, we use clustering to solve three interesting problems on much larger data sets.

First, we attempt to use the public tweets from Twitter (http://twitter.com) to find people who tweet alike using clustering. Second, we examine a data set from Last.fm (http://last.fm), a popular Internet radio website, and try to generate related tags from the data. Finally, we take the full data dump of a popular technology discussion website, Stack Overflow (http://stackoverflow.com), which has around 500,000 questions and 200,000 users. We use this data set to implement related-features functionality for the website.

Our first problem is finding similar users by clustering tweets from Twitter.

12.1. Finding similar users on Twitter

12.2. Suggesting tags for artists on Last.fm

12.3. Analyzing the Stack Overflow data set

12.4. Summary