Chapter 7. Classification, categorization, and tagging

In this chapter

Learn the basic concepts behind classification, categorization, and tagging
Discover how categorization is used in text applications
Build, train, and evaluate classifiers using open source tools
Integrate categorization into a search application
Build a tag recommendation engine trained using tagged data

Chances are you’ve encountered keyword tags somewhere among the websites you’ve visited. Photos, videos, music, news or blog articles, and tweets are frequently accompanied by words and phrases that provide a quick description of the content you’re viewing and a link to related items. You’ve possibly seen tag clouds: displays of different-sized words displaying someone’s favorite discussion topics, movie genres, or musical styles. Tags are everywhere on the web and are used as navigation devices or to organize everything from news to bookmarks (see figure 7.1).

Figure 7.1. Tags used in a twitter post. Hashtags starting with the # character are words used to identify key words in a tweet, whereas tags referencing other users start with the @ character.

Tags are data about data, otherwise referred to as metadata. They can be applied to any sort of content and come in unstructured forms, from a simple list of relevant keywords or usernames to highly structured properties such as height, weight, and eye color.

7.1. Introduction to classification and categorization

Chapter 7. Classification, categorization, and tagging

In this chapter

Figure 7.1. Tags used in a twitter post. Hashtags starting with the # character are words used to identify key words in a tweet, whereas tags referencing other users start with the @ character.

7.1. Introduction to classification and categorization

7.2. The classification process

7.3. Building document categorizers using Apache Lucene

7.4. Training a naive Bayes classifier using Apache Mahout

7.5. Categorizing documents with OpenNLP

7.6. Building a tag recommender using Apache Solr

7.7. Summary

7.8. References

Chapter 7. Classification, categorization, and tagging

In this chapter

Figure 7.1. Tags used in a twitter post. Hashtags starting with the # character are words used to identify key words in a tweet, whereas tags referencing other users start with the @ character.

7.1. Introduction to classification and categorization

7.2. The classification process

7.3. Building document categorizers using Apache Lucene

7.4. Training a naive Bayes classifier using Apache Mahout

7.5. Categorizing documents with OpenNLP

7.6. Building a tag recommender using Apache Solr

7.7. Summary

7.8. References

Unable to load book!