Skip to content

Commit 46d2e23

Browse files
committed
Fixed title and figure size
1 parent ac5efb8 commit 46d2e23

File tree

1 file changed

+17
-17
lines changed

1 file changed

+17
-17
lines changed

adversary-attacks.md

Lines changed: 17 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -14,43 +14,43 @@ Table of contents:
1414
- Defensive Distillation
1515
- Denoising
1616

17-
##### Adversarial Attacks
17+
## Adversarial Attacks
1818
In class, we have seen examples where image classification models can be fooled with adversarial examples. By adding small and often imperceptible perturbations to images, these adversarial attacks can deceive a deep learning model to label the modified image as a completely different class. In this note, we will give an overview of the various types of adversarial attacks on machine learning models, discuss several representative methods for generating adversarial examples, as well as some possible defense methods and mitigation strategies against such adversarial attacks.
1919

2020
**Adversary**. In computer security, the term “adversary” refers to people or machines that attempt to penetrate or corrupt a computer network or system. In the context of machine learning and deep learning, adversaries can use a variety of attack methods to disrupt a machine learning model, and cause it to behave erratically (e.g. to misclassify a dog image as a cat image). In general, attacks can happen either during model training (known as a “poisoning” attack) or after the model has finished training (an “evasion” attack).
2121

22-
##### Poisoning Attack vs. Evasion Attack
22+
### Poisoning Attack vs. Evasion Attack
2323
**Poisoning attack**. A poisoning attack involves polluting a machine learning model's training data. Such attacks take place during the training time of the machine learning model, when an adversary presents data that is intentionally mislabeled to the model, therefore instilling misleading knowledge, and will eventually cause the model to make inaccurate predictions at test time. Poisoning attacks require that an adversary has access to the model’s training data, and is able to inject misleading data into the training set. To make the attacks less noticeable, the adversary may decide to slowly introduce ill-intentioned samples over an extended period of time.
2424

2525
An example of a poisoning attack took place in 2016, when Microsoft launched a Twitter chatbot, Tay. Tay was designed to mimic the language patterns of a 18- to 24- year-old in the U.S. for entertainment purposes, to engage people through “casual and playful conversation”, and to learn from its conversations with human users on Twitter. However, soon after its launch, a vulnerability in Tay was exploited by some adversaries, who interacted with Tay using profane and offensive language. The attack caused the chatbot to learn and internalize inappropriate language. The more Tay engaged with adversarial users, the more offensive Tay’s tweets became. As a result, Tay was quickly shut down by Microsoft, only 16 hours after its launch.
2626

2727
**Evasion attack**. In evasion attacks, the adversary tries to evade or fool the system by adjusting malicious samples during the testing phase. Compared with poisoning attacks, evasion attacks are more common, and easier to conduct. One main reason is that with evasion attacks, adversaries don’t necessarily need to access the training data, nor inject bad data into the training process of the model.
2828

29-
##### White-box vs. Black-box Attacks
29+
### White-box vs. Black-box Attacks
3030
The evasion attacks discussed above occur during the testing phase of the model. The effectiveness of such attacks depends on the amount of information available to the adversary about the model. Before we dive into the various methods for generating adversarial examples, let’s first briefly discuss the differences between white-box attacks and black-box attacks, and understand the various adversarial goals.
3131

3232
**White-box attack**. In a white-box attack, the adversary is assumed to have total knowledge about the model, such as the model architecture, number of layers, the weights of the final trained model, etc. The adversary also has knowledge on the model’s training process, such as the optimization algorithm (e.g. Adam, RMSProp etc.) that is used, the data that the model is trained on, the distribution of the training data, and the model’s performance on the training data. It can be very dangerous if the adversary is able to identify the feature space where the model has a high error rate, and use that information to construct adversarial examples and exploit the model. The more the adversary knows, the more severe the attacks can be, and so are the consequences.
3333

3434
**Black-box attack**. On the contrary, in black-box attacks, the adversary assumes no knowledge about the model. Instead of constructing adversarial examples based on prior knowledge, the adversary exploits a model by providing a series of carefully crafted inputs and observing outputs. Through trial and error, the attacks may eventually be successful in misleading the model to make the wrong predictions.
3535

36-
##### Adversarial Goals
36+
### Adversarial Goals
3737
The goals of the adversarial attacks can be broadly categorized as follows:
3838
- **Confidence Reduction**. The adversary aims to reduce the model’s confidence in its predictions, which does not necessarily lead to the wrong class output. For example, due to the adversarial attack, a model which originally classifies an image of a cat with high probability ends up outputting a lower probability for the same image and class pair.
3939
- **Untargeted Misclassification**. The adversary tries to misguide the model to predict any of the incorrect classes. For example, when presented with an image of a cat, the model outputs any class that is non-cat (e.g. dog, airplane, computer, etc.).
4040
- **Targeted Misclassification**. The adversary tries to misguide the model to output a particular class other than the true class. For example, when presented with an image of a cat, the model is forced to classify it as a dog image, where the output class of dog is specified by the adversary.
4141

4242
Generally speaking, targeted attacks are more sophisticated than untargeted attacks, which are in turn more difficult than confidence reduction.
4343

44-
##### Adversarial Examples
44+
## Adversarial Examples
4545
In CS231n, we mainly focus on examples of adversarial images in the context of image classification. In an adversarial attack, the adversary attempts to modify the original input image by adding some carefully crafted perturbations, which can cause the image classification model to yield mispredictions. Oftentimes, the generated perturbations are either too small to be visually identified by human eyes, or small enough that humans consider them to be harmless, random noise. And yet, these perturbations can be “meaningful” and misleading to the image classification model. Below, we discuss two methods to generate adversarial examples.
4646

4747
<p align="center">
48-
<img src="/assets/adversary-1.png" width="500">
48+
<img src="/assets/adversary-1.png" width="450">
4949
<div class="figcaption">An example of adversarial attack, in which a tiny amount of carefully crafted perturbations leads to misclassification. Here the perturbations are so small that they only become visible to humans after being magnified for about 30 times.</div>
5050
</p>
5151

5252

53-
##### Fast Gradient Sign Method (FGSM)
53+
### Fast Gradient Sign Method (FGSM)
5454
The simplest yet highly efficient algorithm for generating adversarial examples is known as the Fast Gradient Sign Method (FGSM), which is a single step attack on images. Proposed by Goodfellow et al. in 2014, FGSM combines a white box approach with a misclassification goal. Using FGSM, a small perturbation is first generated in the direction of the sign of the gradients with respect to the input image. Next, the generated perturbations are added to the original image, resulting in an adversarial image. The equation for untargeted attack using FGSM is given by:
5555

5656
$$ adv\_x = x + \epsilon*\text{sign}(\nabla_xJ(\theta, x, y)) $$
@@ -116,13 +116,13 @@ for i in range(epochs):
116116
In the example implementation above, we also employ early-stopping and exit the iterative procedure once the model is fooled. This helps minimize the amount of perturbations added, and may also improve the efficiency by reducing the time needed to fool the model. With the iterative approach, we can also obtain additional adversarial examples when the procedure is run for more iterations.
117117

118118
<p align="center">
119-
<img src="/assets/adversary-2.png" width="500">
119+
<img src="/assets/adversary-2.png" width="550">
120120
<div class="figcaption">An example of a “successful” adversarial attack in which the image classifier recognized a watermelon as a tomato. In this case, although the goal of misclassification is achieved, the unmagnified perturbations are large enough to be perceived by human eyes.</div>
121121
</p>
122122

123123
In practice, FGSM attacks work particularly well for network architectures that favor linearity, such as logistic regression, maxout networks, LSTMs, networks that use the ReLU activation function, etc. While ReLU is non-linear, when ϵ is sufficiently small, the ReLU activation does not change the sign of the gradient with respect to the original image, and thus will not prevent the pixels of the image to be nudged in the direction that maximizes the loss. The authors of FGSM stated that changing to nonlinear model families such as RBF networks confer a significant reduction in a model’s vulnerability to adversarial examples.
124124

125-
##### One-Pixel Attack
125+
### One-Pixel Attack
126126
In order to fool a machine learning model, the Fast Gradient Sign Method discussed above requires many pixels of the original image to be changed, if only by a little. As shown in the example image above, sometimes the modifications might be excessive (i.e. the amount of modified pixels are fairly large) such that they become visually identifiable to human eyes. One may then wonder if it’s possible to modify fewer pixels, while still keeping the model fooled? The answer is yes. In 2019, a method for generating one-pixel adversarial perturbations was proposed, in which an adversarial example can be generated by modifying just one pixel.
127127

128128
The One-pixel attack uses differential evolution to find out which pixel is to be changed, and how. Differential evolution (DE) is a type of evolutionary algorithm (EA). It is a population based optimization algorithm for solving complex optimization problems. In specific, during each iteration, a set of candidate solutions (children) is generated according to the current population (parents). The candidate solutions are then compared with their corresponding parents, surviving if they are better candidate solutions according to some criterion. The process repeats until some stopping criterion are met.
@@ -137,17 +137,17 @@ where $$ x_{i} $$ is an element of the candidate solution, $$ g $$ is the curren
137137

138138
By using differential evolution, one-pixel attack has several advantages. Since DE doesn’t use gradient information for optimization, it’s not required for the objective function to be differentiable, as is the case with classical optimization methods such as gradient descent. Calculating the gradient requires much more information about the model to be exploited, therefore not needing gradient information makes conducting the attack more feasible. Finally, it’s worth noting that one-pixel attack is a type of black-box attack, which assumes no information about the classification model; it is sufficient to just observe the model’s output probabilities.
139139

140-
##### Defense Strategies against Adversarial Examples
140+
## Defense Strategies against Adversarial Examples
141141
Having discussed some techniques for generating adversarial examples, we now turn our attention to possible defense strategies against such adversarial attacks. While we go through each of the countermeasures, it’s worth keeping in mind that none of them can act as a panacea for all challenges. Moreover, implementing such defense strategies may incur extra performance costs.
142142

143-
##### Adversarial Training
143+
### Adversarial Training
144144
One of the most intuitive and effective defenses against adversarial attacks is adversarial training. The idea of adversarial training is to incorporate adversarial samples into the model training stage, and thus increase model robustness. In other words, since we know that the original training process leads to models that are vulnerable to adversarial examples, we just also train on adversarial examples so that the models have some “immunity” over the adversarial examples.
145145

146146
To perform adversarial training, the defender simply generates a lot of adversarial examples and include them in the training data. At training time, the model is trained to assign the same label to the adversarial example as to the original example. For example, upon seeing an adversarially perturbed training image, whose original label is cat, the model should learn the correct label for the perturbed image is still cat.
147147

148148
The problem with adversarial training is that it is only effective in defending the models against the same attacks used to craft the examples originally included in the training pool. In black-box attacks, adversaries only need to find one crack in a system’s defenses for an attack to go through. It can be likely that the attack method employed by the adversary is not anticipated by the defender at the time of model training, therefore leaving the adversarially trained model vulnerable to the unseen attacks.
149149

150-
##### Defensive Distillation
150+
### Defensive Distillation
151151
Introduced in 2015 by Papernot et al., defensive distillation uses the idea of distillation and knowledge transfer to reduce the effectiveness of adversarial samples on deep neural networks. The term distillation was originally proposed as a way to transfer knowledge from a large neural network to a smaller one. Doing so can help reduce the computational complexity of deep neural networks, and facilitate the deployment of deep learning models in resource constrained devices. In defensive distillation, instead of transferring knowledge between models of different architectures, the knowledge is extracted from a model to then improve its own resilience to adversarial examples.
152152

153153
Let's assume we are training a neural network for image classification tasks, and the network is designed with a softmax layer as the output layer. The key point in distillation is the addition of a **temperature parameter T** to the softmax operation:
@@ -159,7 +159,7 @@ The authors showed that experimentally, a high empirical value of T gives a bett
159159
Defensive distillation is a two-step process. First, we train an initial network $$F$$ on data $$X$$. In this step, instead of letting the network output hard class labels, we take the probability vectors produced by the softmax layer. The benefit of using class probabilities (i.e. soft labels) instead of hard labels is that in addition to merely providing a sample’s correct class, probabilities also encode the relative differences between classes. Next, we then use the probability vectors from the initial network as the labels to train another distilled network $$F’$$ with the same architecture on the same training data $$X$$. During training, it is important to set the temperature parameter $$T$$ for both networks to a value larger than 1. After training is completed, we will use the distilled network $$F’$$ with $$T$$ set to 1 to make predictions at test time.
160160

161161
<p align="center">
162-
<img src="/assets/defensive-distillation.png" width="500">
162+
<img src="/assets/defensive-distillation.png" width="600">
163163
<div class="figcaption">Defense mechanism based on a transfer of knowledge contained in probability vectors through distillation.</div>
164164
</p>
165165

@@ -174,7 +174,7 @@ $$ g(X) = \sum_{l=0}^{N-1} e^{z_l(X)/T} $$
174174

175175
From the above expression, it can be observed that the amplitude of the Jacobian is inversely proportional to the temperature value. During test time, although $$T$$ is set to a relatively small value of 1, the model’s sensitivity to small variations and perturbations will not be affected, since the weights learned at training time remain unchanged, and decreasing temperature only makes the class probability vector more discrete without changing the relative ordering of the classes.
176176

177-
##### Denoising
177+
### Denoising
178178
Since adversarial examples are images with added perturbations (i.e. noise), one straightforward defense strategy is to have some mechanisms to denoise the adversarial samples. There can be two approaches to denoising: input denoising and feature denoising. Input denoising attempts to partially or fully remove the adversarial perturbations from the input images, whereas feature denoising aims to alleviate the effects of adversarial perturbations on high-level features.
179179

180180
In their study, Chow et al. proposed a method for input denoising with ensembles of denoisers. The intuition of using ensembles for denoising is that there are various ways for adversaries to generate and add perturbations to images, and no single denoiser is guaranteed to be effective across all data corruption methods - a denoiser that excels at removing some types of noise may perform poorly on others. Therefore, it is often helpful to employ an ensemble of diverse denoisers, instead of relying only on a single denoiser.
@@ -188,20 +188,20 @@ where d is a distance function and λ is a regularization hyperparameter penaliz
188188
For feature denoising as a defense strategy, one study was conducted by Xie et al, in which the authors incorporated denoising blocks at intermediate layers of a convolutional neural network. The authors argued that the adversarial perturbation of the features gradually increases as an image is propagated through the network, causing the model to eventually make the wrong predictions. Therefore, it can be helpful to add denoising blocks at intermediate layers of the network to combat feature noise.
189189

190190
<p align="center">
191-
<img src="/assets/denoising-block.png" width="400">
191+
<img src="/assets/denoising-block.png" width="300">
192192
<div class="figcaption">Defense mechanism based on a transfer of knowledge contained in probability vectors through distillation.</div>
193193
</p>
194194

195195
The input to a denoising block can be any feature layer in the convolutional neural network. In the study, each denoising blocks performs one type of the following denoising operations: nonlocal means, bilateral filter, mean filter, and median filter. These are the techniques commonly used in computer vision tasks such as image processing and denoising. The denoising blocks are trained jointly with all layers of the network in an end-to-end manner using adversarial training. In their experiments, denoising blocks were added to the variants of ResNet models. The results showed that the proposed denoising method achieved 55.7 percent accuracy under white-box attacks on ImageNet, whereas previous state of the art was only 27.9 percent accuracy.
196196

197-
##### References
197+
## References
198198
[Explaining and harnessing adversarial examples](https://arxiv.org/abs/1412.6572)
199199
[One pixel attack for fooling deep neural networks](https://arxiv.org/abs/1710.08864)
200200
[Distillation as a Defense to Adversarial Perturbations against Deep Neural Networks](https://arxiv.org/abs/1511.04508)
201201
[Denoising and Verification Cross-Layer Ensemble Against Black-box Adversarial Attacks](https://arxiv.org/abs/1908.07667)
202202
[Feature Denoising for Improving Adversarial Robustness](https://arxiv.org/abs/1812.03411)
203203
[Adversarial Attacks and Defences: A Survey](https://arxiv.org/abs/1810.00069)
204204

205-
##### Additional Resources
205+
## Additional Resources
206206
[Adversarial Robustness - Theory and Practice (NeurIPS 2018 tutorial)](https://adversarial-ml-tutorial.org/)
207207
[CleverHans - An Python Library on Adversarial Example](https://github.com/cleverhans-lab/cleverhans)

0 commit comments

Comments
 (0)