chapter four

4 Analyzing text data

This chapter covers

Classifying text
Extracting information
Clustering documents

Text data is ubiquitous and contains valuable information. For instance, think of newspaper articles, emails, reviews, or, perhaps, this book that you are currently reading! However, analyzing text via computational means has been difficult until only a few years ago. After all, unlike formal languages such as Python, natural language has not been designed to be easy to parse for computers! The latest generation of language models now enables text analysis at almost human levels for many popular tasks. In some cases, the performance of language models for text analysis and generation has even been shown, on average, to surpass the capabilities of humans [1].

In this chapter, we will see how to use large language models to analyze text. In certain ways, analyzing text data is a very “natural” application of language models. They have been trained on large amounts of text and can be applied directly for text analysis (i.e., without referring to external tools for the actual data analysis). This chapter covers several popular flavors of text analysis: classifying text documents, extracting tabular data from text, and clustering text documents into groups of semantically similar documents. For each of these use cases, we will see example code and discuss variants and extensions.

4.1 Preliminaries

4.2 Classification

4.2.1 Overview

4.2.2 Creating prompts

4.2.3 Calling the model

4.2.4 End-to-end classification code

4.2.5 Classifying documents

4.2.6 Running the code

4.2.7 Trying out variants

4.3 Text extraction

4.3.1 Overview

4.3.2 Generating prompts

4.3.3 Post-processing

4.3.4 End-to-end extraction code

4.3.5 Trying it out

4.4 Clustering

4.4.1 Overview

4.4.2 Calculating embeddings

4.4.3 Clustering vectors

4.4.4 End-to-end code for text clustering

4.4.5 Trying it out

4.4.6 Other use cases for embedding vectors

4.5 Summary