15 NLP analysis of large text datasets

 

This section covers

  • Vectorizing texts using scikit-learn
  • Dimensionally reducing vectorized text data
  • Clustering large text datasets
  • Visualizing text clusters
  • Concurrently displaying multiple visualizations

Our previous discussions of natural language processing (NLP) techniques focused on toy examples and small datasets. In this section, we execute NLP on large collections of real-world texts. This type of analysis is seemingly straightforward, given the techniques presented thus far. For example, suppose we’re doing market research across multiple online discussion forums. Each forum is composed of hundreds of users who discuss a specific topic, such as politics, fashion, technology, or cars. We want to automatically extract all the discussion topics based on the contents of the user conversations. These extracted topics will be used to plan a marketing campaign, which will target users based on their online interests.

How do we cluster user discussions into topics? One approach would be to do the following:

15.1 Loading online forum discussions using scikit-learn

15.2 Vectorizing documents using scikit-learn

15.3 Ranking words by both post frequency and count

15.3.1 Computing TFIDF vectors with scikit-learn

15.4 Computing similarities across large document datasets

15.5 Clustering texts by topic

15.5.1 Exploring a single text cluster

15.6 Visualizing text clusters

15.6.1 Using subplots to display multiple word clouds

Summary