The higher level questions are:
-
Are traditional classification ML metrics (Counting True Positives/False Postives/True Negatives etc. and combining as Precision/Recall/F1) sufficient for LLM accuracy for structured extraction.
- Both as a one time computation and as an ongoing evaluation.
-
We also extract "relationships" between entities, which are effectively graph nodes. Is there a different metric for these?
-
What other metrics are useful for LLM monitoring (tool call accuracy, reasoning robustness, faithfulness).
- Is every claim correctly grounded in the text.
-
how do we monitor for prompt regressions/improvements.
-
Is there tooling for optimizing a prompt for a specific result.
-
how do we scalably backtest different models
-
how do frameworks like deepeval (https://deepeval.com/) fit in.
-
how do we monitor for tool call failures due to bad inputs or network downtime.
-
how do we back-propagate feedback from the user into the improvement process.
The higher level questions are:
Are traditional classification ML metrics (Counting True Positives/False Postives/True Negatives etc. and combining as Precision/Recall/F1) sufficient for LLM accuracy for structured extraction.
We also extract "relationships" between entities, which are effectively graph nodes. Is there a different metric for these?
What other metrics are useful for LLM monitoring (tool call accuracy, reasoning robustness, faithfulness).
how do we monitor for prompt regressions/improvements.
Is there tooling for optimizing a prompt for a specific result.
how do we scalably backtest different models
how do frameworks like deepeval (https://deepeval.com/) fit in.
how do we monitor for tool call failures due to bad inputs or network downtime.
how do we back-propagate feedback from the user into the improvement process.