Skip to content

Enhance dspy.Refine with support for Hard/Soft Constraint Handling #8031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

gilad12-coder
Copy link
Contributor

Description:

This PR introduces a significantly enhanced dspy.Refine module, replacing the previous implementation with a more robust, flexible, and controllable system for iteratively improving module predictions.

Motivation:

The previous Refine module primarily focused on retrying a module with varying temperatures and generating LLM-based feedback when a reward_fn score fell below a threshold. While useful, it lacked:

  1. Support for Hard Constraints: No built-in mechanism to enforce programmatic validation rules (e.g., output format, specific content requirements) beyond the soft constraints of a reward function.
  2. Granular Feedback: Feedback was solely LLM-generated based on reward scores, lacking the ability to provide direct, programmatic feedback for specific validation failures.
  3. Configurability: Limited options to control behavior beyond N attempts and the reward threshold.

Changes Introduced:

The new Refine module addresses these limitations by introducing several key features:

  1. Validators (Hard Constraints):

    • Accepts an optional validators argument: a list of functions (Prediction) -> (bool, str).
    • Each validator checks the prediction. If it returns False, the accompanying string provides specific feedback on the failure.
    • Failed validations trigger retry attempts with the collected feedback messages incorporated into the prompt context (as the previous_attempts input field within a signature).
  2. Distinct Handling of Constraints:

    • Hard Constraints (validators): Must pass for a prediction to be considered valid. Failure triggers retries with specific error messages.
    • Soft Constraints (reward_fn, threshold): Evaluated only if validators pass (or if no validators are provided). Falling below the threshold can trigger retries aimed at improving quality, potentially using LLM-generated feedback (OfferFeedback).
  3. Improved Retry Logic & State Management:

    • Retries up to N times with varying temperatures.
    • Tracks the best_prediction based on a clear hierarchy:
      • Prefers predictions passing all validators.
      • Among those passing validators, prefers higher reward_fn scores (if applicable).
      • Among those failing validators, prefers predictions with fewer validation errors.
    • Uses an internal dspy.Predict or dspy.ChainOfThought (controlled by use_cot) to handle retries when validators are used, incorporating the previous_attempts log directly into the signature.
  4. Enhanced Configuration & Control:

    • verbose: Enables detailed logging of the refinement process (validation checks, reward scores, feedback).
    • fail_on_invalid: If True, raises a ValueError if no prediction meets all constraints (validators and reward threshold) after N attempts. If False (default), returns the best prediction found according to the hierarchy above.
    • use_cot: Allows using Chain-of-Thought for the internal prediction steps during refinement when validator feedback is being processed.
  5. Observability:

    • The returned Prediction object includes a Refine_metadata attribute containing details about the refinement process (iterations, success status, final reward, validation status, attempts log).

Breaking Changes:

  • The constructor signature has changed significantly. Users migrating will need to update initialization calls.
  • The core behavior is different due to the introduction of validators and the prioritized execution flow (validators first, then reward).
  • The fail_count parameter is removed, replaced by fail_on_invalid.

@gilad12-coder gilad12-coder changed the title Enhanced dspy.Refine with Validators, Granular Control, and Improved Feedback Enhance dspy.Refine with Validators for Flexible Hard/Soft Constraint Handling Mar 30, 2025
@gilad12-coder gilad12-coder changed the title Enhance dspy.Refine with Validators for Flexible Hard/Soft Constraint Handling Enhance dspy.Refine with support for Hard/Soft Constraint Handling Mar 30, 2025
@gilad12-coder
Copy link
Contributor Author

Heads Up: Be aware that the LLM sometimes struggles to reliably format the complex advice dictionary requested by OfferFeedback signature (used in Refine). There's a try-except fallback to "N/A", but it means the generated feedback might not always be effective. See _get_feedback in Refine.

Copy link
Contributor

@zbambergerNLP zbambergerNLP left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks excellent! Though worth retaining the per-submodule approach of OfferFeedback rather than globalizing its scope to the entire program. Perhaps worth adding a TODO to add support for a global-level feedback signature, and allow users to select which they'd prefer. Also, worth allowing support for Boolean thresholds/rewards, and specifying clearly that rewards/verifier scores are bounded in the range [0, 1].

Comment on lines 38 to 44
feedback: str = OutputField(
desc=(
"Provide concrete and actionable feedback for the module to improve"
" its output on similar inputs in the future. Focus on how to"
" achieve a score >= the threshold. If the module met the"
" threshold, write N/A."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want this to remain a dict[str, str] and maintain the functionality of advice -- notably the breakdown of feedback to specific sub-modules within the program. The existing description seems good for evoking feedback from the LLM, but we should ask for this type of feedback for every sub-module, and not the program as a whole.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When calling a module that wraps the OfferFeedback signature, you'll likely need to format the dictionary into a single string, which will then in turn be formatted into the previous_attempts field in Refine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 17 to 22
In the discussion, analyze why the module failed to meet the desired score threshold based on its inputs and outputs.
Consider the trajectory, reward function, and target score. Assign blame if applicable.
Then, prescribe concrete, actionable advice for how the module should modify its behavior on its future input
when we retry the process, especially if it receives the same or similar inputs.
The module will not see its own history directly, only the feedback you provide.
Focus on specific examples from the trajectory and clear instructions for improvement.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should still aim to offer feedback at the per-submodule level of the program. I would retain the existing functionality of OfferFeedback wherever possible, perhaps adjusting the wording of descriptions slightly how the LM should provide feedback.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Comment on lines 204 to 206
assert isinstance(
threshold, (float, int)
), "`threshold` must be a numerical value."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: can likely fit on one line. Worth supporting booleans as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

history of previous outputs, scores, and feedback via the `previous_attempts` input field.

Example:
>>> import dspy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This style breaks on mkdocs (our doc site), let's use the same style as:

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

result = best_of_3(question="What is the capital of Belgium?").answer
# Returns: Brussels
```
signature (Signature): The DSPy signature defining inputs/outputs for the module being refined.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just 4 indents after linebreak instead of vertical alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@AriMKatz
Copy link

AriMKatz commented May 11, 2025

@gilad12-coder is this gonna go into 3.0? Anything blocking it? thanks for your work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants