chapter four

4 Evaluating DSPy programs

This chapter covers

An overview of DSPy’s evaluation tools
Defining a custom evaluation metric
Running manual evaluations
Using the DSPy Evaluate class to accelerate evaluation
Testing models for accuracy and consistency

As with any machine learning project, it’s important to evaluate an LM-based application before we put it into production for two main reasons. The first relates to tuning the model, which is called optimization in DSPy. This has to do with selecting the LM and the prompt to get the best possible results. The best results may be defined in terms of accuracy, consistency, cost, execution time, or some combination of these and other concerns. We’ll cover optimization in chapter 5, so we won’t look at it here, but note that a key part of optimizing an LM query is evaluating it. We may test many combinations of LMs and prompts, and it can be tricky to determine which truly works the best. For example, with the airline intents classifier we’re currently looking at, many utterances need to be classified, but some prompts may work better for some utterances than others. Determining which prompt works best overall is challenging, so it’s necessary to evaluate each prompt carefully by setting up an automatic testing framework, testing each prompt with a reasonably large set of user messages, and automatically determining the strongest prompt.

4.1 Creating a dataset for evaluation

4.1.1 Using the Example class

4.1.2 Dividing the data into train and test sets

4.1.3 The sizes of the sets

4.1.4 Splitting the ATIS data

4.1.5 Using DSPy to generate examples

4.2 Evaluating a module with a test set

4.2.1 Defining a metric for evaluation

4.2.2 Defining a metric for the baseline model

4.2.3 Calculating a final evaluation for the module

4.2.4 Evaluation for tuning versus for a final evaluation

4.2.5 Testing the metric function

4.3 Evaluating the DSPy baseline model

4.3.1 Using custom python code for evaluation

4.3.2 Using DSPy Evaluate

4.3.3 Rate limits

4.4 Evaluating a manually created prompt using the OpenAI API

4.5 Evaluating the per-class performance

4.6 Evaluating the consistency of responses

4.7 Evaluating other LMs and modules

4.8 Summary