Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestions for new evaluation metrics. #551

Open
richard-tang199 opened this issue Jun 4, 2024 · 1 comment
Open

Suggestions for new evaluation metrics. #551

richard-tang199 opened this issue Jun 4, 2024 · 1 comment
Labels
new feature New feature question Further information is requested

Comments

@richard-tang199
Copy link

richard-tang199 commented Jun 4, 2024

Description

According to "Anomaly scoring is based on overlapping segments: a true positive (TP) if a known anomalous window overlaps any detected windows, a false negative (FN) if a known anomalous window does not overlap any detected windows, and a false positive (FP) if a detected window does not overlap any known anomalous region" in the "AER: Auto-Encoder with Regression for Time Series Anomaly Detection" paper.

We find that this library uses an evaluation method resembling point-adjustment (PA), which can overestimate the model's performance. Details please refer to thuml/Anomaly-Transformer#65 and paper "Towards a Rigorous Evaluation of Time-Series Anomaly Detection".

Please use the recent proposed F1(PA%K) [1] and VUS-PR [2] metric for a rigorous evaluation.

[1] Siwon Kim, Kukjin Choi, Hyun-Soo Choi, Byunghan Lee, and Sungroh Yoon. 2022. Towards a Rigorous Evaluation of Time-Series Anomaly Detection. Proceedings of the AAAI Conference on Artificial Intelligence 36, 7 (June 2022), 7194– 7201. https://doi.org/10.1609/aaai.v36i7.20680
[2] John Paparrizos, Paul Boniol, Themis Palpanas, Ruey S. Tsay, Aaron Elmore, and Michael J. Franklin. 2022. Volume under the Surface: A New Accuracy Evaluation Measure for Time-Series Anomaly Detection. Proceedings of the VLDB Endowment 15, 11 (July 2022), 2774–2787. https://doi.org/10.14778/3551793.3551830

@sarahmish
Copy link
Collaborator

Hi @richard-tang199 – thank you for raising this issue.

The AER paper adopts the overlapping evaluation method proposed in the NASA paper. This evaluation strategy has been used in many papers and all the models in the benchmark use the same evaluation strategy. Therefore, the comparison is unified.

I would like to note that there are differences between the mentioned overlap strategy and point-adjustment (PA). In PA, the model is rewarded for predicting a single point as anomaly if it overlaps with the ground truth anomaly. Therefore, the model is incentivized to predict one point as anomaly at various locations hoping that one of it overlaps with the true one. The false positives in this situation is not severe because the number of normal instances substantially overweights anomalies.
In our evaluation method, we do not concern ourselves with the normal sequences. We evaluate the ability of finding the anomaly without looking at the normal class. Moreover, all the anomalies are predicted as an interval with (start, end) timestamps of various lengths, this reduces the chances of exploiting the point anomalies as I have mentioned earlier.

Nevertheless, we appreciate the suggestion of including more metrics in our evaluation criteria as we are always interesting in having a wholesome evaluation to all models. We also welcome their implementation in our evaluation sub-package if you would like to take the opportunity and contribute.

Please let me know if you have any further questions

@sarahmish sarahmish changed the title Concerns about the evaluation metric. Suggestions for new evaluation metrics. Jun 11, 2024
@sarahmish sarahmish added question Further information is requested new feature New feature labels Jun 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
new feature New feature question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants