chapter fifteen

15 NLP Analysis of Large Text Datasets

 

This section covers:

  • Vectorizing texts using Scikit-Learn
  • Dimensionally-reducing vectorized text data
  • Clustering large text datasets
  • Visualizing text clusters
  • Concurrently displaying multiple visualizations

Our previous discussions of NLP techniques focused on toy examples and small datasets. In this section, we will finally proceed to execute NLP on large collections of real-world texts. This type of analysis is seemingly straightforward, given the techniques presented thus far. For example, suppose we’re doing market research across multiple online discussion forums. Each forum is composed of hundreds of users who discuss a specific topic, such as politics, or fashion, or technology, or cars. We want to automatically extract all the discussion topics based on the contents of the user conversions. These extracted topics will be used to plan a marketing campaign, which will target users based on their online interests.

How do we cluster user discussions into topics? One approach would be to do the following:

  1. Convert all discussion texts into a matrix of word-counts, using techniques discussion in Section Thirteen.
  2. Dimensionally reduced the word-count matrix using SVD. This will allow us to efficiently complete all pairs of text-similarities, with matrix multiplication.
  3. Utilize the matrix of text-similarities to cluster the discussions into topics.
  4. Explore the topic clusters in order to identify useful topics for our marketing campaign.

15.1  Loading Online Forum Discussions Using Scikit-Learn

15.2  Vectorizing Documents Using Scikit-Learn

15.3  Ranking Words by Both Post-Frequency and Count

15.3.1  Computing TFIDF Vectors with Scikit-Learn

15.4  Computing Similarities Across Large Document Datasets

15.5  Clustering Texts by Topic

15.5.1  Exploring a Single Text Cluster

15.6  Visualizing Text Clusters

15.6.1  Using Subplots to Display Multiple Word Clouds

15.7  Summary