chapter four

4 Analyzing text data

This chapter covers

Classifying text
Extracting information
Clustering documents

Text data is ubiquitous and contains valuable information. For instance, think of newspaper articles, emails, reviews, or perhaps this book you are reading! However, analyzing text via computational means was difficult until only a few years ago. After all, unlike formal languages such as Python, natural language was not designed to be easy for computers to parse. The latest generation of language models enables text analysis at almost human levels for many popular tasks. In some cases, the performance of language models for text analysis and generation has even been shown, on average, to surpass the capabilities of humans [1].

In this chapter, we will see how to use large language models to analyze text. In certain ways, analyzing text data is a very “natural” application of language models. They have been trained on large amounts of text and can be applied directly for text analysis (i.e., without referring to external tools for the actual data analysis). This chapter covers several popular flavors of text analysis: classifying text documents, extracting tabular data from text, and clustering text documents into groups of semantically similar documents. For each of these use cases, we will see example code and discuss variants and extensions.

4.1 Preliminaries

4.2 Classification

4.2.1 Overview

4.2.2 Creating prompts

4.2.3 Calling the model

4.2.4 End-to-end classification code

4.2.5 Classifying documents

4.2.6 Running the code

4.2.7 Trying out variants

4.3 Text extraction

4.3.1 Overview

4.3.2 Generating prompts

4.3.3 Postprocessing

4.3.4 End-to-end extraction code

4.3.5 Trying it out

4.4 Clustering

4.4.1 Overview