Skip to content

Latest commit

Β 

History

History
27 lines (15 loc) Β· 1.37 KB

File metadata and controls

27 lines (15 loc) Β· 1.37 KB

Interpreting-Adversarial-Examples-Via-CAM

Here, we implemented a image-based interpretability of adversarial examples, which links pixel-level perturbations to class-determinative image regions localized by class activation mapping (CAM). The adversarial examples are gernerated by the method proposed in this paper "Adversarial Attack Type I: Cheat Classifiers by Significant Changes"

Alt text

Type I attack: Generate an adversarial example that is different to the original one in the view of the attacker

Generate adversarial example π‘₯β€² for x from a supervised variational auto-encoder (G)

x' = G(x), 𝑠.𝑑.  𝑓1 (π‘₯β€²)  = 𝑓1 (π‘₯), 𝑑(𝑔2 (π‘₯), 𝑔2 (π‘₯β€²)) ≫ πœ€ 

Type II attack: Generate false negatives examples

Generate adversarial example π‘₯β€² for x from a supervised variational auto-encoder (G)

x' = G(x), 𝑠.𝑑.  𝑓1 (π‘₯β€² ) β‰  𝑓1 (π‘₯), 𝑑(𝑔2 (π‘₯), 𝑔2 (π‘₯β€²)) ≀ πœ€ 

Alt text

  1. Use a global average pooling (GAP) layer at the end of neural networks instead of a fully-connected layer resulted in specific localization.