4 Evaluating DSPy programs
This chapter covers
- An overview of DSPy’s evaluation tools
- Defining a custom evaluation metric
- Running manual evaluations
- Using the DSPy Evaluate class to accelerate evaluation
- Testing models for accuracy and consistency
As with any machine learning project, it’s important to evaluate an LM-based application before we put it into production for two main reasons. The first relates to tuning the model, which is called optimization in DSPy. This has to do with selecting the LM and the prompt to get the best possible results. The best results may be defined in terms of accuracy, consistency, cost, execution time, or some combination of these and other concerns. We’ll cover optimization in chapter 5, so we won’t look at it here, but note that a key part of optimizing an LM query is evaluating it. We may test many combinations of LMs and prompts, and it can be tricky to determine which truly works the best. For example, with the airline intents classifier we’re currently looking at, many utterances need to be classified, but some prompts may work better for some utterances than others. Determining which prompt works best overall is challenging, so it’s necessary to evaluate each prompt carefully by setting up an automatic testing framework, testing each prompt with a reasonably large set of user messages, and automatically determining the strongest prompt.