-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43
Comments
Hi, I implemented SMATCH++, that also contains an ILP solver. It should be very simple to run. Please do:
Then simply run:
The output is F1: 100.0 Precision: 100.0 Recall: 100.0 |
@BramVanroy the reason your example does not work well is that it contains many similar components, which the hill-climbing implemented in the current smatch package is more likely to have different node matching (like matching the first "team" in the first AMR to the 2nd/3rd.. "team" in the second AMR), thus the different scores. This is due to the NP-completeness of this semantic graph matching problem. Computing the exact solution is hard and time-consuming, so we employed the hill-climbing method to make this solving faster, but it had the weakness of depending on the initialization. Unfortunately I no longer have the time to actively work and improve this. Please feel free to try smatch++ mentioned above to see if the speed and accuracy works for your case. |
Thanks for the reply @snowblink14. I'm mostly worried about the random differences. Wouldn't it make sense to fix the randomness so that everyone using it experiences the same behavior? Now it becomes quite hard to reproduce results (the example I gave is taken from the frequently used AMR 3.0 corpus). |
Fixing random seed is a hack that doesn't really help with anything. The main problem with the hill-climber is that is has no useful upper-bound (you never know how far off you are from the best solution and thereby not know if you got the best solution). ILP gives optimal alignment, and yes, it is NP complete, but it works for standard evaluation setup. NP complete also doesn't mean that you will not have a useful score in a hypothetical case where it doesn't find the optimal solution. Even intermediate solutions can be better than hill-climbing (hill-climber deteriorates even more for large graphs), and with ilp you will always have a meaningful upper bound (telling you the quality of the current solution). You can read more in my paper |
I found two quirks in the following example.
Perhaps something is wrong with my
calculate_smatch
function, but I do not think so. (It is modified fromscore_amr_pairs
.)Output
The non-determinism is very worrying to me. If an evaluation metric is not deterministic, how then can we compare systems to each other in a fair way? A difference of 0.92 vs 0.87 is massive for the same input/output.
The text was updated successfully, but these errors were encountered: