Skip to content

Conversation

@TomeHirata
Copy link
Collaborator

@TomeHirata TomeHirata commented Mar 24, 2025

In this PR, we change the output interface of Evaluate.__call__.
Instead of returning either score, (score, outputs), (score, scores, outputs) based on arguments, it will always return EvaluationResult containing the following fields:

  • score: A float percentage score (e.g., 67.30) representing overall performance
  • results: a list of (example, prediction, score) tuples for each example in devset

Since this is a breaking change, this should be released in the next minor release rather than a patch release.

Resolve mlflow/mlflow#15476

Copy link
Collaborator

@chenmoneygithub chenmoneygithub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid work! LGTM with one minor comment.

Let's talk offline for potential breakage, and align on the release schedule.

@TomeHirata TomeHirata force-pushed the feat/evaluate-response branch from 7aeb618 to ed8fd13 Compare March 27, 2025 00:35
@Nasreddine
Copy link

please merge the PR to be able to get the individual example level evaluation score. This will be useful for mlflow tracing.

@Nasreddine
Copy link

Hi @okhat, could you review this PR when you have a moment? Thanks!

@TomeHirata TomeHirata force-pushed the feat/evaluate-response branch from cb0c5bd to a718520 Compare June 17, 2025 01:19
def __eq__(self, other):
if isinstance(other, (float, int)):
return self.__float__() == other
elif isinstance(other, Prediction):
Copy link
Collaborator

@okhat okhat Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm this is really dangerous! It's a really bad idea. Why did we add this?

Copy link
Collaborator Author

@TomeHirata TomeHirata Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was added mainly for two reasons:

  • To support the existing logic that compares the equality of evaluation outputs. Users may also do eval(program_1)==eval(program_2) in their code
  • Completeness for the comparison operator of dspy.Prediction. It is not consistent if >= works for dspy.Prediction but == does not

@TomeHirata TomeHirata merged commit d52f7a5 into stanfordnlp:main Jun 19, 2025
11 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FR] DSPy : Log examples evaluation scores

4 participants