In this chapter, you will learn how to automatically detect topics in text, either selecting from the set of known topics or discovering new, previously unseen ones. This is a challenging and practically useful task that can be approached from different perspectives using a variety of methods. This chapter will introduce new techniques, some of which are closely related to the ones that you’ve been using before. Let’s put this task in a broader context before diving deep into the implementation issues.
Previous chapters presented a number of NLP applications that required you to build a machine-learning model that can classify text. Let’s summarize them here:
- In chapter 2, you looked into how to build your own spam filter that can classify incoming email into spam or ham.
- In chapters 5 and 6, you developed an author-identification tool that can detect whether a text is written by one of the known authors (e.g., Jane Austen or William Shakespeare, or one of your contacts should you wish to apply this tool to your own data).
- In chapters 7 and 8, you learned how to build a sentiment analyzer that can classify a text (e.g., a review) as the one expressing a positive or a negative opinion.