Skip to content

Evaluation Thoughts #112

@bpblanken

Description

@bpblanken

The higher level questions are:

  • Are traditional classification ML metrics (Counting True Positives/False Postives/True Negatives etc. and combining as Precision/Recall/F1) sufficient for LLM accuracy for structured extraction.

    • Both as a one time computation and as an ongoing evaluation.
  • We also extract "relationships" between entities, which are effectively graph nodes.  Is there a different metric for these?

  • What other metrics are useful for LLM monitoring (tool call accuracy, reasoning robustness, faithfulness).

    • Is every claim correctly grounded in the text.
  • how do we monitor for prompt regressions/improvements.

  • Is there tooling for optimizing a prompt for a specific result.

  • how do we scalably backtest different models

  • how do frameworks like deepeval (https://deepeval.com/) fit in.

  • how do we monitor for tool call failures due to bad inputs or network downtime.

  • how do we back-propagate feedback from the user into the improvement process.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions