Here, we implemented a image-based interpretability of adversarial examples, which links pixel-level perturbations to class-determinative image regions localized by class activation mapping (CAM). The adversarial examples are gernerated by the method proposed in this paper "Adversarial Attack Type I: Cheat Classifiers by Significant Changes"
Type I attack: Generate an adversarial example that is different to the original one in the view of the attacker
Generate adversarial example π₯β² for x from a supervised variational auto-encoder (G)
x' = G(x), π .π‘. π1 (π₯β²) = π1 (π₯), π(π2 (π₯), π2 (π₯β²)) β« π
Type II attack: Generate false negatives examples
Generate adversarial example π₯β² for x from a supervised variational auto-encoder (G)
x' = G(x), π .π‘. π1 (π₯β² ) β π1 (π₯), π(π2 (π₯), π2 (π₯β²)) β€ π
- Use a global average pooling (GAP) layer at the end of neural networks instead of a fully-connected layer resulted in specific localization.