This section covers:
- Vectorizing texts using Scikit-Learn
- Dimensionally-reducing vectorized text data
- Clustering large text datasets
- Visualizing text clusters
- Concurrently displaying multiple visualizations
Our previous discussions of NLP techniques focused on toy examples and small datasets. In this section, we will finally proceed to execute NLP on large collections of real-world texts. This type of analysis is seemingly straightforward, given the techniques presented thus far. For example, suppose we’re doing market research across multiple online discussion forums. Each forum is composed of hundreds of users who discuss a specific topic, such as politics, or fashion, or technology, or cars. We want to automatically extract all the discussion topics based on the contents of the user conversions. These extracted topics will be used to plan a marketing campaign, which will target users based on their online interests.
How do we cluster user discussions into topics? One approach would be to do the following:
- Convert all discussion texts into a matrix of word-counts, using techniques discussion in Section Thirteen.
- Dimensionally reduced the word-count matrix using SVD. This will allow us to efficiently complete all pairs of text-similarities, with matrix multiplication.
- Utilize the matrix of text-similarities to cluster the discussions into topics.
- Explore the topic clusters in order to identify useful topics for our marketing campaign.