chapter five

5 Basic text mining using generative AI

This chapter covers

Frequency analysis
Co-occurrence analysis
Keyword search
Dictionary-based methods

In previous chapters, you dealt with numerical data and learned the basic analytical methods for translating a bunch of numbers into sound business advice. This chapter and the next will show you how to deal with something far more sinister than numbers—text. Most of the text you’ll encounter won’t be clean, literary language that an author has double-checked and had edited by specialists. More often, you will deal with hastily prepared notes, offhand reviews, and emails. Such data is riddled with errors that can significantly impact the quality of analysis and results. These include spelling mistakes, typographical and punctuation errors, and irregular use of capitalization. Additionally, texts often contain irrelevant or redundant information such as headers, footers, or metadata, as well as linguistic noise from nonstandard abbreviations, slang, or jargon. Just when you think you are prepared to handle all this by adapting your text-cleaning functions, you may encounter another exception, like a piece of . . . ASCII art!

5.1 Text mining in the era of generative AI

5.1.1 Generative AI is a game changer

5.1.2 Beware of AI intimidation

5.1.3 Unpacking the constraints

5.2 Preparing for analysis

5.2.1 Data quality

5.2.2 Customer feedback preparation example

5.3 Frequency analysis

5.3.1 What can we learn from frequency analysis of customer reviews?

5.3.2 Direct frequency analysis with generative AI

5.3.3 Uploading a data file to ChatGPT for frequency analysis

5.3.4 Extracting the most common words

5.3.5 Extracting the most common phrases

5.3.6 Understanding the output

5.4 Co-occurrence analysis

5.4.1 What can we learn from co-occurrence analysis?

5.4.2 Co-occurrence analysis in practice

5.4.3 Understanding the output

5.5 Keyword search

5.5.1 What can we learn from keyword search?

5.5.3 Generating keywords in practice

Summary