Evaluation Thoughts

The higher level questions are:

- Are traditional classification ML metrics (Counting True Positives/False Postives/True Negatives etc. and combining as Precision/Recall/F1) sufficient for LLM accuracy for structured extraction.
  - Both as a one time computation and as an ongoing evaluation.
- We also extract "relationships" between entities, which are effectively graph nodes.  Is there a different metric for these?

- What other metrics are useful for LLM monitoring (tool call accuracy, reasoning robustness, faithfulness).
  - Is every claim correctly grounded in the text.
- how do we monitor for prompt regressions/improvements.
- Is there tooling for optimizing a prompt for a specific result.
- how do we scalably backtest different models
- how do frameworks like deepeval (https://deepeval.com/) fit in.
- how do we monitor for tool call failures due to bad inputs or network downtime.
- how do we back-propagate feedback from the user into the improvement process.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation Thoughts #112

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Evaluation Thoughts #112

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions