You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
According to "Anomaly scoring is based on overlapping segments: a true positive (TP) if a known anomalous window overlaps any detected windows, a false negative (FN) if a known anomalous window does not overlap any detected windows, and a false positive (FP) if a detected window does not overlap any known anomalous region" in the "AER: Auto-Encoder with Regression for Time Series Anomaly Detection" paper.
We find that this library uses an evaluation method resembling point-adjustment (PA), which can overestimate the model's performance. Details please refer to thuml/Anomaly-Transformer#65 and paper "Towards a Rigorous Evaluation of Time-Series Anomaly Detection".
Please use the recent proposed F1(PA%K) [1] and VUS-PR [2] metric for a rigorous evaluation.
[1] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. 2022. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence 36, 7 (June 2022), 7194– 7201. https://doi.org/10.1609/aaai.v36i7.20680
[2] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J. Franklin. 2022. Volume under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. Proceedings of the VLDB Endowment 15, 11 (July 2022), 2774–2787. https://doi.org/10.14778/3551793.3551830
The text was updated successfully, but these errors were encountered:
The AER paper adopts the overlapping evaluation method proposed in the NASA paper. This evaluation strategy has been used in many papers and all the models in the benchmark use the same evaluation strategy. Therefore, the comparison is unified.
I would like to note that there are differences between the mentioned overlap strategy and point-adjustment (PA). In PA, the model is rewarded for predicting a single point as anomaly if it overlaps with the ground truth anomaly. Therefore, the model is incentivized to predict one point as anomaly at various locations hoping that one of it overlaps with the true one. The false positives in this situation is not severe because the number of normal instances substantially overweights anomalies.
In our evaluation method, we do not concern ourselves with the normal sequences. We evaluate the ability of finding the anomaly without looking at the normal class. Moreover, all the anomalies are predicted as an interval with (start, end) timestamps of various lengths, this reduces the chances of exploiting the point anomalies as I have mentioned earlier.
Nevertheless, we appreciate the suggestion of including more metrics in our evaluation criteria as we are always interesting in having a wholesome evaluation to all models. We also welcome their implementation in our evaluation sub-package if you would like to take the opportunity and contribute.
Please let me know if you have any further questions
sarahmish
changed the title
Concerns about the evaluation metric.
Suggestions for new evaluation metrics.
Jun 11, 2024
Description
According to "Anomaly scoring is based on overlapping segments: a true positive (TP) if a known anomalous window overlaps any detected windows, a false negative (FN) if a known anomalous window does not overlap any detected windows, and a false positive (FP) if a detected window does not overlap any known anomalous region" in the "AER: Auto-Encoder with Regression for Time Series Anomaly Detection" paper.
We find that this library uses an evaluation method resembling point-adjustment (PA), which can overestimate the model's performance. Details please refer to thuml/Anomaly-Transformer#65 and paper "Towards a Rigorous Evaluation of Time-Series Anomaly Detection".
Please use the recent proposed F1(PA%K) [1] and VUS-PR [2] metric for a rigorous evaluation.
[1] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. 2022. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence 36, 7 (June 2022), 7194– 7201. https://doi.org/10.1609/aaai.v36i7.20680
[2] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J. Franklin. 2022. Volume under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. Proceedings of the VLDB Endowment 15, 11 (July 2022), 2774–2787. https://doi.org/10.14778/3551793.3551830
The text was updated successfully, but these errors were encountered: