5 Basic text mining using generative AI
This chapter covers
- Frequency analysis
- Co-occurrence analysis
- Keyword search
- Dictionary-based methods
In previous chapters, you dealt with numerical data and learned the basic analytical methods for translating a bunch of numbers into sound business advice. This chapter and the next will show you how to deal with something far more sinister than numbers—text. Most of the text you’ll encounter won’t be clean, literary language that an author has double-checked and had edited by specialists. More often, you will deal with hastily prepared notes, offhand reviews, and emails. Such data is riddled with errors that can significantly impact the quality of analysis and results. These include spelling mistakes, typographical and punctuation errors, and irregular use of capitalization. Additionally, texts often contain irrelevant or redundant information such as headers, footers, or metadata, as well as linguistic noise from nonstandard abbreviations, slang, or jargon. Just when you think you are prepared to handle all this by adapting your text-cleaning functions, you may encounter another exception, like a piece of . . . ASCII art!