9 Topic analysis

 

This chapter covers

  • Implementing a supervised approach to topic classification with scikit-learn
  • Using multiclass classification for NLP tasks
  • Discovering topics in an unsupervised way
  • Implementing an unsupervised approach—clustering with scikit-learn

In this chapter, you will learn how to automatically detect topics in text, either selecting from the set of known topics or discovering new, previously unseen ones. This is a challenging and practically useful task that can be approached from different perspectives using a variety of methods. This chapter will introduce new techniques, some of which are closely related to the ones that you’ve been using before. Let’s put this task in a broader context before diving deep into the implementation issues.

Previous chapters presented a number of NLP applications that required you to build a machine-learning model that can classify text. Let’s summarize them here:

  • In chapter 2, you looked into how to build your own spam filter that can classify incoming email into spam or ham.
  • In chapters 5 and 6, you developed an author-identification tool that can detect whether a text is written by one of the known authors (e.g., Jane Austen or William Shakespeare, or one of your contacts should you wish to apply this tool to your own data).
  • In chapters 7 and 8, you learned how to build a sentiment analyzer that can classify a text (e.g., a review) as the one expressing a positive or a negative opinion.

9.1 Topic classification as a supervised machine-learning task

9.1.1 Data

9.1.2 Topic classification with Naïve Bayes

9.1.3 Evaluation of the results

9.2 Topic discovery as an unsupervised machine-learning task

9.2.1 Unsupervised ML approaches

9.2.2 Clustering for topic discovery

9.2.3 Evaluation of the topic clustering algorithm

Summary

Solutions to miscellaneous exercises