You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Latest metrics in the 20230824 evaluation are pretty low. This could be due to several reasons:
comparison of Span is more strict, since it is based on character offsets rather than the strings themselves. System annotations must have the exact same character offsets as gold annotations in order to be considered a match. Substrings are no longer considered a match.
Annotations are additionally compared on their type property (i.e. the category of the System entity must match the gold annotation). Even if a system annotation has the correct Span and KBID, it would still be a miss if the type does not match the gold.
It could be insightful to add a more fine-grained evaluation for each annotation property. Specifically, by computing precision, recall, and F1 for (some options)--
Span alone ?
Span + KBID ?
Span + type ?
If metrics are particularly low for one of these compared to others, it might show where the app could be improved.
Done when
More fine-grained evaluation is implemented (or we decide it's not necessary).
Additional context
No response
The text was updated successfully, but these errors were encountered:
Because
Latest metrics in the 20230824 evaluation are pretty low. This could be due to several reasons:
Span
is more strict, since it is based on character offsets rather than the strings themselves. System annotations must have the exact same character offsets as gold annotations in order to be considered a match. Substrings are no longer considered a match.type
property (i.e. the category of the System entity must match the gold annotation). Even if a system annotation has the correctSpan
andKBID
, it would still be a miss if thetype
does not match the gold.It could be insightful to add a more fine-grained evaluation for each annotation property. Specifically, by computing precision, recall, and F1 for (some options)--
Span
alone ?Span
+KBID
?Span
+type
?If metrics are particularly low for one of these compared to others, it might show where the app could be improved.
Done when
More fine-grained evaluation is implemented (or we decide it's not necessary).
Additional context
No response
The text was updated successfully, but these errors were encountered: