Suggestions for new evaluation metrics. #551

richard-tang199 · 2024-06-04T09:54:18Z

Description

According to "Anomaly scoring is based on overlapping segments: a true positive (TP) if a known anomalous window overlaps any detected windows, a false negative (FN) if a known anomalous window does not overlap any detected windows, and a false positive (FP) if a detected window does not overlap any known anomalous region" in the "AER: Auto-Encoder with Regression for Time Series Anomaly Detection" paper.

We find that this library uses an evaluation method resembling point-adjustment (PA), which can overestimate the model's performance. Details please refer to thuml/Anomaly-Transformer#65 and paper "Towards a Rigorous Evaluation of Time-Series Anomaly Detection".

Please use the recent proposed F1(PA%K) [1] and VUS-PR [2] metric for a rigorous evaluation.

[1] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. 2022. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence 36, 7 (June 2022), 7194– 7201. https://doi.org/10.1609/aaai.v36i7.20680
[2] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J. Franklin. 2022. Volume under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. Proceedings of the VLDB Endowment 15, 11 (July 2022), 2774–2787. https://doi.org/10.14778/3551793.3551830

sarahmish · 2024-06-11T03:30:57Z

Hi @richard-tang199 – thank you for raising this issue.

The AER paper adopts the overlapping evaluation method proposed in the NASA paper. This evaluation strategy has been used in many papers and all the models in the benchmark use the same evaluation strategy. Therefore, the comparison is unified.

I would like to note that there are differences between the mentioned overlap strategy and point-adjustment (PA). In PA, the model is rewarded for predicting a single point as anomaly if it overlaps with the ground truth anomaly. Therefore, the model is incentivized to predict one point as anomaly at various locations hoping that one of it overlaps with the true one. The false positives in this situation is not severe because the number of normal instances substantially overweights anomalies.
In our evaluation method, we do not concern ourselves with the normal sequences. We evaluate the ability of finding the anomaly without looking at the normal class. Moreover, all the anomalies are predicted as an interval with (start, end) timestamps of various lengths, this reduces the chances of exploiting the point anomalies as I have mentioned earlier.

Nevertheless, we appreciate the suggestion of including more metrics in our evaluation criteria as we are always interesting in having a wholesome evaluation to all models. We also welcome their implementation in our evaluation sub-package if you would like to take the opportunity and contribute.

Please let me know if you have any further questions

sarahmish changed the title ~~Concerns about the evaluation metric.~~ Suggestions for new evaluation metrics. Jun 11, 2024

sarahmish added question Further information is requested new feature New feature labels Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suggestions for new evaluation metrics. #551

Suggestions for new evaluation metrics. #551

richard-tang199 commented Jun 4, 2024 •

edited

Loading

sarahmish commented Jun 11, 2024

Suggestions for new evaluation metrics. #551

Suggestions for new evaluation metrics. #551

Comments

richard-tang199 commented Jun 4, 2024 • edited Loading

Description

sarahmish commented Jun 11, 2024

richard-tang199 commented Jun 4, 2024 •

edited

Loading