Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

BramVanroy · 2023-08-08T08:42:54Z

I found two quirks in the following example.

even though the smatch score is calculated between the same graph, the f1 score is not 1 (far from it)
the scores are non-deterministic. Sometimes the score is 0.9, then 0.92, then 0.87, etc.

Perhaps something is wrong with my calculate_smatch function, but I do not think so. (It is modified from score_amr_pairs.)

from typing import List

import smatch


def calculate_smatch(refs_penman: List[str], preds_penman: List[str]):
    total_match_num = total_test_num = total_gold_num = 0
    n_invalid = 0

    for sentid, (ref_penman, pred_penman) in enumerate(zip(refs_penman, preds_penman), 1):
        best_match_num, test_triple_num, gold_triple_num = smatch.get_amr_match(
            ref_penman, pred_penman, sent_num=sentid
        )

        total_match_num += best_match_num
        total_test_num += test_triple_num
        total_gold_num += gold_triple_num
        # clear the matching triple dictionary for the next AMR pair
        smatch.match_triple_dict.clear()

    score = smatch.compute_f(total_match_num, total_test_num, total_gold_num)

    return {
        "smatch_precision": score[0],
        "smatch_recall": score[1],
        "smatch_fscore": score[2],
        "ratio_invalid_amrs": n_invalid / len(preds_penman) * 100,
    }


s = """(r / result-01
   :ARG1 (c / compete-01
            :ARG0 (w / woman)
            :mod (p / preliminary)
            :time (t / today)
            :mod (p2 / polo
                     :mod (w2 / water)))
   :ARG2 (a / and
            :op1 (d / defeat-01
                    :ARG0 (t2 / team
                              :mod (c2 / country
                                       :wiki +
                                       :name (n / name
                                                :op1 "Hungary")))
                    :ARG1 (t3 / team
                              :mod (c3 / country
                                       :wiki +
                                       :name (n2 / name
                                                 :op1 "Canada")))
                    :quant (s / score-entity
                              :op1 13
                              :op2 7))
            :op2 (d2 / defeat-01
                     :ARG0 (t4 / team
                               :mod (c4 / country
                                        :wiki +
                                        :name (n3 / name
                                                  :op1 "France")))
                     :ARG1 (t5 / team
                               :mod (c5 / country
                                        :wiki +
                                        :name (n4 / name
                                                  :op1 "Brazil")))
                     :quant (s2 / score-entity
                                :op1 10
                                :op2 9))
            :op3 (d3 / defeat-01
                     :ARG0 (t6 / team
                               :mod (c6 / country
                                        :wiki +
                                        :name (n5 / name
                                                  :op1 "Australia")))
                     :ARG1 (t7 / team
                               :mod (c7 / country
                                        :wiki +
                                        :name (n6 / name
                                                  :op1 "Germany")))
                     :quant (s3 / score-entity
                                :op1 10
                                :op2 8))
            :op4 (d4 / defeat-01
                     :ARG0 (t8 / team
                               :mod (c8 / country
                                        :wiki +
                                        :name (n7 / name
                                                  :op1 "Russia")))
                     :ARG1 (t9 / team
                               :mod (c9 / country
                                        :wiki +
                                        :name (n8 / name
                                                  :op1 "Netherlands")))
                     :quant (s4 / score-entity
                                :op1 7
                                :op2 6))
            :op5 (d5 / defeat-01
                     :ARG0 (t10 / team
                                :mod (c10 / country
                                          :wiki +
                                          :name (n9 / name
                                                    :op1 "United"
                                                    :op2 "States")))
                     :ARG1 (t11 / team
                                :mod (c11 / country
                                          :wiki +
                                          :name (n10 / name
                                                     :op1 "Kazakhstan")))
                     :quant (s5 / score-entity
                                :op1 10
                                :op2 5))
            :op6 (d6 / defeat-01
                     :ARG0 (t12 / team
                                :mod (c12 / country
                                          :wiki +
                                          :name (n11 / name
                                                     :op1 "Italy")))
                     :ARG1 (t13 / team
                                :mod (c13 / country
                                          :wiki +
                                          :name (n12 / name
                                                     :op1 "New"
                                                     :op2 "Zealand")))
                     :quant (s6 / score-entity
                                :op1 12
                                :op2 2))))
"""

if __name__ == "__main__":
    for _ in range(5):
        smatch_score = calculate_smatch([s], [s])
        print(smatch_score)

Output

{'smatch_precision': 0.8866666666666667, 'smatch_recall': 0.8866666666666667, 'smatch_fscore': 0.8866666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.88, 'smatch_recall': 0.88, 'smatch_fscore': 0.88, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8666666666666667, 'smatch_recall': 0.8666666666666667, 'smatch_fscore': 0.8666666666666667, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.9266666666666666, 'smatch_recall': 0.9266666666666666, 'smatch_fscore': 0.9266666666666666, 'ratio_invalid_amrs': 0.0}
{'smatch_precision': 0.8533333333333334, 'smatch_recall': 0.8533333333333334, 'smatch_fscore': 0.8533333333333335, 'ratio_invalid_amrs': 0.0}

The non-determinism is very worrying to me. If an evaluation metric is not deterministic, how then can we compare systems to each other in a fair way? A difference of 0.92 vs 0.87 is massive for the same input/output.

The text was updated successfully, but these errors were encountered:

flipz357 · 2023-08-17T06:40:00Z

Hi,

I implemented SMATCH++, that also contains an ILP solver. It should be very simple to run.

Please do:

pip install mip==1.13.0
pip install smatchpp

Then simply run:

python -m smatchpp -a largeamr.txt -b largeamr.txt -solver ilp

The output is

F1: 100.0 Precision: 100.0 Recall: 100.0

snowblink14 · 2023-08-19T21:55:55Z

@BramVanroy the reason your example does not work well is that it contains many similar components, which the hill-climbing implemented in the current smatch package is more likely to have different node matching (like matching the first "team" in the first AMR to the 2nd/3rd.. "team" in the second AMR), thus the different scores.

This is due to the NP-completeness of this semantic graph matching problem. Computing the exact solution is hard and time-consuming, so we employed the hill-climbing method to make this solving faster, but it had the weakness of depending on the initialization.

Unfortunately I no longer have the time to actively work and improve this. Please feel free to try smatch++ mentioned above to see if the speed and accuracy works for your case.

BramVanroy · 2023-08-20T08:15:57Z

Thanks for the reply @snowblink14. I'm mostly worried about the random differences. Wouldn't it make sense to fix the randomness so that everyone using it experiences the same behavior? Now it becomes quite hard to reproduce results (the example I gave is taken from the frequently used AMR 3.0 corpus).

flipz357 · 2023-08-20T09:01:45Z

@BramVanroy

Fixing random seed is a hack that doesn't really help with anything. The main problem with the hill-climber is that is has no useful upper-bound (you never know how far off you are from the best solution and thereby not know if you got the best solution).

ILP gives optimal alignment, and yes, it is NP complete, but it works for standard evaluation setup. NP complete also doesn't mean that you will not have a useful score in a hypothetical case where it doesn't find the optimal solution. Even intermediate solutions can be better than hill-climbing (hill-climber deteriorates even more for large graphs), and with ilp you will always have a meaningful upper bound (telling you the quality of the current solution).

You can read more in my paper

BramVanroy mentioned this issue Aug 8, 2023

Different triple order = low smatch score. How to avoid automatic inversion? goodmami/penman#117

Closed

flipz357 mentioned this issue Nov 30, 2023

The Smatch metric of amrlib has problem bjascob/amrlib#61

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

BramVanroy commented Aug 8, 2023 •

edited

Loading

flipz357 commented Aug 17, 2023

snowblink14 commented Aug 19, 2023

BramVanroy commented Aug 20, 2023

flipz357 commented Aug 20, 2023

Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

Smatch is non-deterministic and does not yield score=1 for the same input/output graph #43

Comments

BramVanroy commented Aug 8, 2023 • edited Loading

flipz357 commented Aug 17, 2023

snowblink14 commented Aug 19, 2023

BramVanroy commented Aug 20, 2023

flipz357 commented Aug 20, 2023

BramVanroy commented Aug 8, 2023 •

edited

Loading