Change the output interface of evaluate #8003

TomeHirata · 2025-03-24T09:55:13Z

In this PR, we change the output interface of Evaluate.__call__.
Instead of returning either score, (score, outputs), (score, scores, outputs) based on arguments, it will always return EvaluationResult containing the following fields:

score: A float percentage score (e.g., 67.30) representing overall performance
results: a list of (example, prediction, score) tuples for each example in devset

Since this is a breaking change, this should be released in the next minor release rather than a patch release.

Resolve mlflow/mlflow#15476

dspy/evaluate/evaluate.py

chenmoneygithub

Solid work! LGTM with one minor comment.

Let's talk offline for potential breakage, and align on the release schedule.

dspy/primitives/prediction.py

Nasreddine · 2025-04-25T06:54:20Z

please merge the PR to be able to get the individual example level evaluation score. This will be useful for mlflow tracing.

Nasreddine · 2025-05-13T17:14:31Z

Hi @okhat, could you review this PR when you have a moment? Thanks!

dspy/evaluate/evaluate.py

okhat · 2025-06-19T02:37:50Z

dspy/primitives/prediction.py

+    def __eq__(self, other):
+        if isinstance(other, (float, int)):
+            return self.__float__() == other
+        elif isinstance(other, Prediction):


Hmm this is really dangerous! It's a really bad idea. Why did we add this?

This was added mainly for two reasons:

To support the existing logic that compares the equality of evaluation outputs. Users may also do eval(program_1)==eval(program_2) in their code

Completeness for the comparison operator of dspy.Prediction. It is not consistent if >= works for dspy.Prediction but == does not

TomeHirata requested review from chenmoneygithub and okhat March 24, 2025 09:55

TomeHirata commented Mar 24, 2025

View reviewed changes

dspy/evaluate/evaluate.py Outdated Show resolved Hide resolved

TomeHirata mentioned this pull request Mar 25, 2025

Add logging of result table for DSPy optimizer tracking mlflow/mlflow#15061

Merged

39 tasks

chenmoneygithub approved these changes Mar 25, 2025

View reviewed changes

dspy/primitives/prediction.py Outdated Show resolved Hide resolved

TomeHirata force-pushed the feat/evaluate-response branch from 7aeb618 to ed8fd13 Compare March 27, 2025 00:35

chenmoneygithub approved these changes Mar 27, 2025

View reviewed changes

TomeHirata added 9 commits April 1, 2025 16:39

change the output interface of evaluate

6f1968a

make the usage consistent

2efdc85

clean up remaining codes

d7e466a

fix mipro

302ba77

remove all_scores

bb5563c

format comment

c82851c

rename outputs

790f05b

rename it to results

86dabf1

pass empty results

ce1ea2d

TomeHirata force-pushed the feat/evaluate-response branch from 2f7a26e to ce1ea2d Compare April 1, 2025 07:39

TomeHirata mentioned this pull request Apr 9, 2025

[Bug] when return_outputs is True and return_all_scores is True, COPRO compile will crash #8027

Closed

okhat self-assigned this Apr 16, 2025

TomeHirata mentioned this pull request Apr 25, 2025

[FR] DSPy : Log examples evaluation scores mlflow/mlflow#15476

Closed

22 tasks

TomeHirata added the dspy 3.0 label May 28, 2025

merge main

a718520

TomeHirata force-pushed the feat/evaluate-response branch from cb0c5bd to a718520 Compare June 17, 2025 01:19

introduce EvaluationResult class

2a1d0f2

chenmoneygithub reviewed Jun 17, 2025

View reviewed changes

dspy/evaluate/evaluate.py Outdated Show resolved Hide resolved

dspy/evaluate/evaluate.py Show resolved Hide resolved

dspy/evaluate/evaluate.py Outdated Show resolved Hide resolved

address comments

061b872

okhat reviewed Jun 19, 2025

View reviewed changes

remove dspy.Prediction.__eq__ method

4d9d758

TomeHirata added 4 commits June 19, 2025 14:45

lint

e52ce6a

fix test_eval_candidate_program_failure

eae8d41

fix grpo

9dcddd6

fix grpo

0c1e0a1

TomeHirata merged commit d52f7a5 into stanfordnlp:main Jun 19, 2025
11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Change the output interface of evaluate #8003

Change the output interface of evaluate #8003

Uh oh!

TomeHirata commented Mar 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

chenmoneygithub left a comment

Uh oh!

Uh oh!

Nasreddine commented Apr 25, 2025

Uh oh!

Nasreddine commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

okhat Jun 19, 2025 •

edited

Loading

Uh oh!

TomeHirata Jun 19, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Change the output interface of evaluate #8003

Change the output interface of evaluate #8003

Uh oh!

Conversation

TomeHirata commented Mar 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

chenmoneygithub left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Nasreddine commented Apr 25, 2025

Uh oh!

Nasreddine commented May 13, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

okhat Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomeHirata Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TomeHirata commented Mar 24, 2025 •

edited

Loading

okhat Jun 19, 2025 •

edited

Loading

TomeHirata Jun 19, 2025 •

edited

Loading