Learn2Perturb: an End-to-end Feature Perturbation Learning to Improve Adversarial Robustness

2020·Arxiv

Abstract

Abstract

While deep neural networks have been achieving state-of-the-art performance across a wide variety of applications, their vulnerability to adversarial attacks limits their widespread deployment for safety-critical applications. Alongside other adversarial defense approaches being investigated, there has been a very recent interest in improving adversarial robustness in deep neural networks through the introduction of perturbations during the training process. However, such methods leverage fixed, pre-defined perturbations and require significant hyper-parameter tuning that makes them very difficult to leverage in a general fashion. In this study, we introduce Learn2Perturb, an end-to-end feature perturbation learning approach for improving the adversarial robustness of deep neural networks. More specifically, we introduce novel perturbation-injection modules that are incorporated at each layer to perturb the feature space and increase uncertainty in the network. This feature perturbation is performed at both the training and the inference stages. Furthermore, inspired by the Expectation-Maximization, an alternating back-propagation training algorithm is introduced to train the network and noise parameters consecutively. Experimental results on CIFAR-10 and CIFAR-100 datasets show that the proposed Learn2Perturb method can result in deep neural networks which are 4-7% more robust on FGSM and PDG adversarial attacks and significantly outperforms the state-of-the-art against C&W attack and a wide range of well-known black-box attacks.

1. Introduction

The vulnerability of DNN models to adversarial examples have raised major concerns [1, 7, 9, 25, 31] on their large-scale adaption in a wide variety of applications.

Adversarial attacks can be divided into two categories of black-box and white-box attacks based on the level of information available to the attacker. Black-box attacks usually perform queries on the model, and they have partial information regarding the data and the structure of the targeted model [16, 28]. On the other hand, white-box attacks have a better understanding of the model that they attack to; therefore, they are more powerful than black-box attacks [14, 34]. This understanding might vary between different white-box attack algorithms; nonetheless, gradients of the model’s loss function with respect to the input data is the most common information utilized to modify input samples and generate adversarial examples. First-order white-box adversaries are the most common attacking algorithms which only use the first order of gradients [6, 24, 25, 34] to craft the adversarial perturbation.

In the realm of defense mechanisms, approaches like distillation [26, 29], feature denoising [35], and adversarial training [11, 24] have been proposed to resolve the vulnerability of DNNs on adversarial attacks. Adversarial training is considered as a very intuitive yet very promising solution to improve the robustness of DNN models against adversarial attacks.

Madry et al. [24] illustrated that adversarial learning using Projected Gradient Descent (PGD) for generating on-the-fly adversarial samples during training can lead to trained models which provide robustness guarantees against all first-order adversaries. They experimentally showed that the adversarial examples in a ball distance around the original sample with many random starts in the ball generated with PGD, all have approximately the same loss value when are fed to the network as input. Due to this fact, they provide the guarantee that as long as the attack algorithm is a first-order adversary, the local maximas of the loss value would not be significantly better than those found by PGD.

Applying regularization techniques is another approach

to train more robust network models [5]. To do so, either new loss functions were proposed with added or embedded regularization terms (i.e., adversarial generalization) [10, 15, 33] or the network was augmented with new modules [14, 20, 21, 22] for regularization purposes making the network more robust at the end.

Randomization approaches and specifically random noise injection [14, 21, 22] has been recently proposed as one of the network augmentation methods to address the adversarial robustness in deep neural networks. A random noise generator as an extra module is embedded in the network architecture adding random noise to the input or the output of layers. Although the noise distribution usually follows a Gaussian distribution for its simplicity, it is possible to use different noise distributions. This noise augmentation technique adds more uncertainty in the network and makes the adversarial attack optimization harder which improves the robustness of the model.

While the noise injection technique has shown promising results, determining the parameters of the distribution and how to add the noise values to the network are still challenging. The majority of the methods proposed in literature [20, 21, 22, 37] manually select the parameters of the distribution. However, He et al. [14] recently proposed a new algorithm in which the noise distributions are learned in the training step. Their proposed Parametric Noise Injection (PNI) technique injects trainable noise to the activations or the weights of the CNN model. The problem associated to the proposed PNI technique is that the noise parameters tend to converge to zero as the training progresses, making the noise injection progressively less effective over time. This problem is partially compensated through the utilization of PGD adversarial training as suggested by Madry et al. [24], but the decreasing trend of noise parameter magnitudes still remains, and thus, limits the overall effect of the PNI.

In this paper, the Learn2Perturb framework, an end-to-end feature perturbation learning approach is proposed to improve the robustness of DNN models. An alternating back-propagation strategy is introduced where the following two steps are performed in an alternating manner: i) the network parameters are updated in the presence of feature perturbation injection to improve adversarial robustness, and ii) the parameters of the perturbation injection modules are updated to strengthen perturbation capabilities against the improved network. Decoupling these two steps helps both sets of parameters (i.e., network parameters and perturbation injection modules) to be trained to their full functionalities and produces a more robust network. To this end, our contributions can be folded as below:

• A highly efficient and stable end-to-end learning mechanism is introduced to learn the perturbation-injection modules to improve the model robustness

against adversarial attacks. The proposed alternating back-propagation method inspired by ExpectationMaximization (EM) concept trains the network and noise parameters in a consecutive way gradually without any significant parameter-tuning effort.

• A new effective regularizer is introduced to help the network learning process which smoothly improves the noise distributions. Combining this regularizer and PGD-adversarial training helps the proposed Learn2Perturb algorithm achieve the state-of-the-art performances.

• Exhaustive experiments are conducted for various white-box and black-box adversarial attacks on CIFAR-10 and CIFAR-100 datasets, and new state-of-the-art performances are reported for these algorithms.

The paper is organized as follows: section 2 provides a discussion of related work in terms of different adversarial attacks; the proposed Learn2Perturb approach is presented in Section 3 and experimental results and discussion are presented in Section 4 followed by a conclusion.

2. Related Work

The gradients of the loss function with respect to the input data are very common information used by adversarial attack algorithms. In this type of approaches, the proposed algorithms try to maximize the loss value of the network by crafting the minimum perturbations into input data.

Fast Gradient Sign Method (FGSM) [34] is the simplest yet a very efficient white-box attack. For a DNN parametrized with W (i.e., where the network is encoded as ) and loss function L, for any input x, the FGSM attack computes the adversarial example as:

where determines the attack strength and signreturns the sign tensor for a given tensor. Using this gradient ascent step, FGSM tries to locally maximize the loss function L.

The FGSM approach is extended by projected gradient descent (PGD) [19, 24] where for a number of k iterations, PGD produces bound, in which is the original input and . Using projection, the boundsimply ensures that is within a specified range of the original input x.

Madry et al. [24] illustrated that different PGD attack restarts, each with a random initialization for input within the –ball around x, find different local maximas with very similar loss values. Based on this finding, they claimed that PGD is a universal first-order adversary.

C&W attack [6] is another strong first-order attack algorithm which finds perturbation added to input x by solving

the optimization problem formulated as:

where p shows the norm distance. While p can be any arbitrary number, C&W is most effective when p = 2; as such, here, we only consider -norm for C&W evaluations. Moreover, encodes the objective function driving the perturbed sample to be misclassified (ideally ), and c is a constant balancing the two terms involved in 2. It is worth noting that, all the white-box attacks explained here (i.e. FGSM, PGD, and C&W) are first-order adversaries.

Black-box attacks can only access a model via queries; sending inputs and receiving corresponding outputs to estimate the inner working of the network. To fool a network, the well-known black-box attacks either use surrogate networks [16, 28] or estimate the gradients [8, 32] via multiple queries to the targeted network.

In the surrogate network approach, a new network mimicking the behavior of the target model [28] is trained. Attackers perform queries on the target model and generate a synthetic dataset with the query inputs and associated outputs. Having this dataset, a surrogate network is trained. Recent works [23, 28] showed that adversarial examples fooling the surrogate model can also fool the target model with a high success rate. A simple variant of the surrogate model attack, Transferability adversarial attacks [27], is when the surrogate model has access to the same training data as the interested network. Adversarial examples fooling the substituted network usually transfer to (fool) the target model as well. Since substitute networks may not always be successful [8, 16], black-box gradient estimation attacks only deal with the target model itself. Zeroth order optimization (ZOO) [8] and attacks alternating only a few pixels [32] approaches are examples of this kind of black-box attack, to name a few.

3. Methodology

In this work, we propose a new framework called Learn2Perturb for improving the adversarial robustness of a deep neural network through end-to-end feature perturbation learning. Although it has been illustrated both theoretically and practically [2, 30] that randomization techniques can improve the robustness of deep neural networks1, there is still not an effective way to select the distribution of the noise in the neural networks. In Learn2Perturb, trainable perturbation-injection modules are integrated into a deep neural network with the goal of injecting customized perturbations into the feature space at different parts of the network to increase the uncertainty of its inner workings within an optimal manner. We formulate the joint problem of learning the model parameters and the perturbation distri- butions of the perturbation-injection modules in an end-to- end learning framework via an alternating back-propagation approach [12]. As shown in Figure 1, the proposed alternating back-propagation strategy for the joint learning of the network parameters and the perturbation-injection modules is inspired from the EM technique; and it comprises of two key alternating steps: i) Perturbation-injected network training: the network parameters are trained by gradient descent while the proposed perturbation-injection modules add layer-wise noise to the feature maps (different locations in the network). Noise injection parameters are fixed during this step. ii) Perturbation-injection module training: the parameters of the perturbation-injectionmodules are updated via gradient descent and based on the regularization term added to the network loss function, while network parameters are fixed.

The effect of using such a training strategy is that in step (i), the model minimizes the loss function of the classifi-cation problem when noise is being injected into multiple layers, and the model learns how to classify despite the injected perturbations. And in step (ii), the noise parameters are updated with a combination of network gradients and the regularization term applied to these parameters. The goal of this step is to let the network react to the noise injections via gradient descent and pose a bigger challenge to the network via a smooth increase of noise based on the regularizer. The trained perturbation-injection modules perturb the feature layers of the model in the inference phase as well.

3.1. Perturbation-Injection Distribution

Given the observable variables X, W as the input and the set of weights in the neural network, respectively, the goal is to model the neural network as a probabilistic model such that the output of the model, Y , is drawn from a distribution rather than a deterministic function. A probabilistic output is more robust against adversarial perturbation. As such, Y can be formulated as:

where W and show the set of network and noise parameters, respectively, and X is the input fed into the network. The output Y is drawn from a distribution driven from W and the set of independent parameters, .

For a given layer l of the neural network, the perturbation-injection modules can be used to achieve the following probability model for the layer’s final activations:

where represents the activation of layer l with weights as its input, and is a noise distribu-

Figure 1. Overview of Learn2Perturb: During training, an alternating back-propagation strategy is introduced where the following two steps are performed in an alternating manner: i) the network parameters are updated in the presence of feature perturbation injection to improve adversarial robustness, and ii) the parameters of the perturbation injection modules are updated to strengthen perturbation capabilities against improved network. The learned perturbation injection modules can be added to some or all tensors in the network to inject perturbations in feature space for two-prong adversarial robustness: i) improve robustness during training when training under perturbation injection, and ii) increase network uncertainty through interference-time perturbation injection to make it difficult to learn an adversarial attack.

The parameter scales the magnitude of the output from the normal distribution encoding the standard deviation of the distribution . Substituting the right hand-side of defined in (5) into (4) enforces to follow a Gaussian distribution:

This new probabilistic formulation of layer activations can be extended to the whole network, so instead of a deterministic output Y , network outputs , with W and showing the parameters of all layers.

Having this new formulation for a deep neural network, a proper training process to effectively learn both sets of these parameters is highly desired. To this end, we propose a new training mechanism to learn both network parameters and perturbation-injection modules in an alternating back-propagation approach.

3.2. Alternating Back-Propagation

The proposed neural network structure comprises of two sets of parameters, W and , being trained given training samples (X, T ) as the input and the ground truth output to the network. However, these two sets of parameters are in conflict with each other and try to push the learning process in two opposite directions. Having the probabilistic representation is mapping the input X to output T based on the mean of the distribution ; while, the set of improves the generalization of the model by including perturbations into the training mechanism.

The proposed alternating back-propagation framework decouples the learning process associated to network parameters W and perturbation-injection distributions to effectively update both sets of parameters. To this end, the network parameters and perturbation-injection modules are updated in a consecutive manner.

The training process of the proposed Learn2Perturb is done within two main steps:

• Perturbation-injected network training; the parameters of the network, W, are updated via gradient descent to decrease the network loss in the presence of perturbations, caused by the currently fixed perturbation-injection distribution, .

• Perturbation-injection distribution training; the parameters of the perturbation-injection distribution, , are updated given the set of parameters W are fixed to improve the generalization of the network and as a result, improve its robustness against adversarial perturbation.

These two steps are performed consecutively; however, the number of iterations for each step before moving to the next step can be determined based on the application.

Utilizing a generic loss function in the training of the network when the perturbation-injection modules are embedded forces the noise parameters to converge to zero and eventually removes the effect of the perturbation-injection distributions by making them very small. In other words, the neural network with generic loss tends to learn as a Dirac distribution where the is close to zero; to pre-

vent the aforementioned problem, a new regularization term is designed and added to the loss function. As such the new loss function can be formulated as:

where is the classification loss function (i.e., usually cross entropy) such that the set of parameters W need to be tuned to generate the associated output of the input X. The function is the regularizer enforcing smooth increase in the parameters , where shows the jth noise parameter in the lth layer, corresponding to an element of the output feature map. K and represent the number of layers and noise parameters per layer, respectively. is the hyper-parameter balancing the two terms in the optimization. Independent distributions are learnt for perturbation-injection models in each layer. The regularizer function should be enforced with an annealing characteristic where the perturbation-injection distributions are gradually improved and converged thus the parameters W can be trained effectively. As such the regularization function is formulated as below:

where is the output of a harmonic series given the current epoch value in the training process. Using a harmonic series to determine , gradually decreases the effect of the reqularizer function in the loss and lets the neural network converge. While the squared root of makes the equation easier to take the derivative, it also provides a slower rate of change for larger values of which helps the network to converge to a steady state smoothly.

As seen in Algorithm 1 at first, the perturbation-injection distributions Q and network parameters W are initialized. Then the model parameters W are updated based on the classification loss , and this loss function is minimized in the presence of perturbation-injection modules. Then, the perturbation-injection distributions Q are updated by performing the “perturbation-injection module training” step.

One of the main advantages of this approach is that since the learning process of these two sets of parameters is decoupled, the training process can be easily performed without a large amount of manual hyper-parameter tweaking compared to other randomized state-of-the-art approaches. Moreover, the proposed method can help the model to converge faster as the perturbation-injection distributions are continuously improved during the training process.

3.3. Model Setup, Training and Inference

Perturbation-injection distributions are added to the network in different locations and specifically after each convolution operation to create a new network model based on the Learn2Perturb framework. As shown in Figure 1, these modules generate the perturbations with the same size as the feature activation maps of that specific layer. Each perturbation-injection distribution follows independent distribution and therefore, the generated perturbation value for each feature is drawn independently.

In the training phase, the model parameters and the perturbation-injection distributions are trained in an iterative and consecutive manner and based on the proposed alternating back-propagation approach. It is worth to mention that the model parameters are trained for 20 epochs before activating the perturbation distributions to help the network parameters converge to a good initial point. After 20 epochs, the alternating back-propagation is applied to train both model parameters and perturbation-injection distributions. Furthermore, we take advantage of adversarial training technique which adds on-the-fly adversarial examples into the training data, to improve the model’s robustness more effectively against perturbations. As such PGD adversarial technique is incorporated in the training to provide stronger guarantee bounds against all first-order adversaries optimizing in space.

The perturbation-injection distributions are applied in the inference step, as well. This will introduce a dynamic nature into the inference process and as a result, it makes it harder for the adversaries to find an optimal adversarial examples to fool the network.

4. Experiments

To illustrate the effectiveness of the proposed Learn2Perturb, we train various models using this framework and evaluate their robustness against different adversarial attack algorithms. Furthermore, the proposed method is compared with different state-of-the-art approaches including PGD adversarial training [24] (also denoted as Vanilla model), Parametric Noise Injection (PNI) [14], Adversarial Bayesian Neural Network (AdvBNN) [22], Random Self-Ensemble (RSE) [21] and PixelDP (DP) [20].

4.1. Dataset & Adversarial Attacks

For the evaluation purpose, the CIFAR-10 and CIFAR-100 datasets2 [18] are utilized for training and evaluating the networks. Both of these datasets contain 50,000 training data and 10,000 test data of natural color images of ; however, CIFAR-10 has 10 different class with 6000 images per class, while CIFAR-100 has 100 classes with 600 images per class.

Different white-box and black-box attacks are utilized to evaluate the proposed Learn2Perturb along with state-of-the-art methods. The competing algorithms are evaluated via white-box attacks including FGSM [34], PGD [19] and C&W attacks [6]. One-Pixel attack [32], and Transferability attack [27] are utilized as the black-box attacks to evaluate the competing method.

4.2. Experimental Setup

We use ResNet based architectures [13] as the baseline for our experiments; The classical ResNet architecture (i.e., ResNet-V1 and its variations) and the new ResNet architecture (i.e., ResNet-V2) are used for evaluation. The main difference between two architectures is the number of stages and the number of blocks in each stage. Moreover, average pooling is utilized for down-sampling in ResNet-V1 architecture while the ResNet-V2 uses CNN layers for this purpose. Followed by the experimental setup proposed in [14], data normalization is done via adding a nontrainable layer at the beginning of the network and the adversarial perturbations are directly added to the original input data, before normalization being applied. Both adversarial training and robustness testing setup follow the same configurations as introduced in [24] and [14]. Adversarial training with PGD and testing robustness against PGD, are both done in 7 iterations with the maximum (i.e., ) and step sizes of 0.01 for each iteration. FGSM attack also uses the same 8/255 limit for perturbation. For C&W attack, we use ADAM [17] optimizer with learning rate . Maximum number of iterations is 1000, and for the constant c in 2 we choose the range to ; furthermore to find the value of c, binary search with up to 9 steps is performed. The confidence, , parameter of C&W attack, which turns out to have a big effect while evaluating defense approaches involving randomization, takes values ranging from 0 to 5.

In the case of transferability attacks, a PGD adversarially trained network (i.e. a vanilla model) is used as the source network for generating adversarial examples and these adversarial samples are then utilized to attack competing models. For one/few-pixel attacks, we consider the case {1, 2, 3}-pixel attack in this work. 3

4.3. Experimental Results

To evaluate the proposed Learn2Perturb framework, the method is compared with PGD adversarial trained model (also denoted as Vanilla). The proposed module is evaluated on three different ResNet architectures. Table 1 shows the effectiveness of the proposed Learn2Perturb method in improving the robustness of different networks architectures. Results demonstrate that the proposed perturbation-injection modules improve the network’s robustness. As seen, the proposed perturbation-injection modules can provide robust performance on both ‘ResNet-V1’ (both with 20 and 56 layers) and ‘ResNet-V2’ (18 layers) architectures against both FGSM and PGD attacks which illustrates the effectiveness of the proposed module in providing more robust network architectures. Furthermore, the evaluation results for no defense approach (a network without any improvement) are provided as a reference point.

We also evaluate a variation of the proposed Learn2Perturb framework (i.e. Learn2Perturb-R) where we analyze a different approach in performing the two steps of “perturbation-injected network training” and “perturbation-injection module training”. In this variation, the perturbation-injection modules are only updated using the regularizer function , and network gradients are not used to update parameters.

As it can be seen in table 1, taking advantage of both network gradient and the regularizer performs better than when we only take into account the regulizer effect. One reason to justify this outcome is allowing the gradient of loss function to update perturbation-injection modules in Learn2Perturb. This would let the loss function to react to perturbations when they cannot tolerate the injected noise and updates the perturbation-injection noise modules more frequently. Nonetheless, the results in table 1 show that Learn2Perturb-R still outperforms other proposed methods in adversarial robustness, though it suffers from smaller clean data accuracy.

4.4. Robustness Comparison

In this section, to further illustrate the effectiveness of the proposed Learn2Perturb framework, we compare

Table 1. Evaluating the effectiveness of the proposed perturbation-injection modules by comparing against adversarial training algorithm (Vanilla) within the proposed framework and its variation (Learn2Perturb-R).

Figure 2. Analyzing the effectiveness of the proposed method compared to state-of-the-art algorithms on different values for FGSM attack.

Learn2Perturb with PNI [14] and Adv-BNN [22] as two randomization state-of-the-art approaches to improve the robustness of deep neural networks. Table 2 reports these comparison results for different network architectures varying in network depth and capacity. We examine the effect of different network depths including ResNet-V1(20), ResNet-V1(32), ResNet-V1(44) and ResNet-V1(56) along with the effect of network width in Table 2 by increasing the number of filters in ResNet-V1(20) which results to ResNet-V1(20)[1.5], ResNet-V1(20)[2] and ResNet-V1(20)[4]. As seen, while the competing methods do not provide consistent performance by increasing the capacity of the network (increasing depth or width) the proposed framework provides consistent robustness through different network capacities.

The reported results in Table 2 show that while PNI provides minor boosting in network accuracy on clean data, the proposed Learn2Perturb method performs with much higher accuracy when the input data is perturbed with adversarial noise. The main reason for this phenomena is the fact that PNI reach to a very low level of the noise perturbation during the training as the loss function tries to remove the effect of perturbation by making the noise parameters to zero. The results demonstrate that the proposed Learn2Perturb algorithm outperforms the PNI method by 4-7% on both FGSM and PGD adversarial attacks. The proposed method is also compared with Adv-BNN [22]. Results show that while Adv-BNN can provide robust networks in some cases compared to PNI, it is not scalable when the network width is increased and the performance

Figure 3. Evaluating the robustness of the proposed Learn2Perturb compared with other state-of-the-art methods through different based on PGD attack.

of the networks drop drastically. This is illustrated one of the drawbacks of Bayesian approach which they need to be designed carefully for each network architecture separately.

It has been shown, there is no guarantee that methods robust against attacks would provide same level of robustness against attacks [2]. Araujo et al. [2] illustrated experimentally that randomization technique trained with can improve the robustness against attacks as well. In this work we further validate this finding. In order to provide more powerful attacks challenging the effect of randomization, we apply C&W attacks with different con-fidence values, . The parameter enforces the in 2 to be rather than simply 0. As seen in Table 3, for bigger values of the success rate of C&W attack increases; nonetheless, our proposed method outperforms the other competing methods with a big margin for all values of .

Table 4 shows the comparison results for the proposed method and state-of-the-art approaches in providing robust network model on CIFAR-10 dataset. The proposed Learn2Perturb method outperforms other state-the-art methods and provides a more robust network model with better performance when dealing with PGD attack.

We also analyze the effectiveness of the proposed method in dealing with different adversarial noise levels. To this end, the ResNet-V2(18) architecture is utilized for all competing methods. The network architectures are designed and trained via four different competing methods; and the trained networks are examined with both FGSM and PGD attacks but with a variation of values.

Table 2. The effect of network capacity on the performance of the proposed method and other state-of-the-art algorithms. The proposed Learn2Perturb is compared with Parametric Noise Injection (PNI) method [14] and Adv-BNN [22]. Results shows the effectiveness of the proposed Learn2Perturb algorithm in training robust neural network models. To have a fair comparison, we evaluated methods on different network sizes and capacities. Result are reported by standard deviation because of the randomness involved in these methods.

Table 3. Comparison results of the proposed Learn2Perturb and competing methods on C&W [6] attack. Confidence

Table 4. Comparison results of the proposed Learn2Perturb and state-of-the-art methods in providing a robust network model. Some of the numbers are extracted from [14]. The reported results are either based on the maximum accuracy achieved in the literature or own results if we achieved higher level of accuracy.

Figure 2 demonstrates the robustness of four competing methods in dealing with FGSM adversarial attack. As seen, while increasing decreases the robustness of all trained networks, the network designed and trained by the proposed Learn2Perturb approach outperforms other methods through all variations of adversarial noise values (’s).

To confirm the results shown in Figure 2, the same experiment is conducted to examine the robustness of the trained networks on PGD attack. While the PGD attack is more powerful in fooling the networks, results show that the network designed and trained by the proposed Learn2Perturb framework still outperforms other state-of-the-art approaches.

4.5. Expectation over Transformation (EOT)

Athalye et. al [3] showed that many of the defense algorithms that take advantage of injecting randomization to network interior layers or applying random transformations on the input before feeding it to the network achieve robustness through false stochastic gradients. They further stated that these methods obfuscate the gradients that attackers utilize to perform iterative attacking optimizations. As such, they proposed the EOT attack (originally introduced in [4]) to evaluate these types of defense mechanisms. They showed

that the false gradients cannot protect the network when the attack uses the gradients which are the expectation over a series of transformations.

Since our Learn2Perturb algorithm and other competing methods involve randomization, the tested algorithms in this study are evaluated via the EOT attack method as well. To do so, followed by [30], at every iteration of PGD attack, the gradient is achieved as the expectation calculated from a Monte Carlo method with 80 simulations of different transformations. Results show that the network trained via PNI can provide 48.65% robustness compared to AdvBNN which provides 51.19% robustness for the CIFAR-10 dataset against this attack. Experimental result illustrates that the proposed Learn2Perturb approach can produce a model which achieves 53.34% robustness and outperforms the other two state-of-the-art algorithms.

It is worth mentioning that the experimental results showed that neither the proposed Learn2Perturb method nor the other competing approaches studied in this work suffer from obfuscated gradients. Furthermore, the proposed Learn2Perturb method successfully passes the five metrics introduced in [3], and thus further illustrates that Learn2Perturb is not subjected to obfuscated gradients.

5. Conclusion

In this paper, we proposed Learn2Perturb, an end-to-end feature perturbation learning approach for improving adversarial robustness of deep neural networks. Learned perturbation injection modules are introduced to increase uncertainty during both training and inference to make it harder to craft successful adversarial attacks. A novel alternating back-propagation approach is also introduced to learn both network parameters and perturbation-injection module parameters in an alternating fashion. Experimental results on both different black-box and white-box attacks demonstrated the efficacy of the proposed Learn2Perturb algorithm, which outperformed the state-of-the-art methods in improving robustness against different adversarial attacks. Future work involves exploring extending the proposed modules to inject a greater perturbation type diversity for greater generalization in terms of adversarial robustness.

References

[1] Naveed Akhtar and Ajmal Mian. Threat of adversarial attacks on deep learning in computer vision: A survey. IEEE Access, 6:14410–14430, 2018. 1

[2] Alexandre Araujo, Rafael Pinot, Benjamin Negrevergne, Laurent Meunier, Yann Chevaleyre, Florian Yger, and Jamal Atif. Robust neural networks using randomized adversarial training. arXiv preprint arXiv:1903.10219, 2019. 3, 7

[3] Anish Athalye, Nicholas Carlini, and David Wag- ner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018. 8

[4] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust adversarial examples. arXiv preprint arXiv:1707.07397, 2017. 8

[5] Alberto Bietti, Gr´egoire Mialon, Dexiong Chen, and Julien Mairal. A kernel perspective for regularizing deep neural networks. In International Conference on Machine Learning, pages 664–674, 2019. 2

[6] Nicholas Carlini and David Wagner. Towards evaluat- ing the robustness of neural networks. In 2017 IEEE Symposium on Security and Privacy (SP), pages 39– 57. IEEE, 2017. 1, 2, 6, 8

[7] Nicholas Carlini and David Wagner. Audio adversarial examples: Targeted attacks on speech-to-text. In 2018 IEEE Security and Privacy Workshops (SPW), pages 1–7. IEEE, 2018. 1

[8] Pin-Yu Chen, Huan Zhang, Yash Sharma, Jinfeng Yi, and Cho-Jui Hsieh. Zoo: Zeroth order optimization based black-box attacks to deep neural networks without training substitute models. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 15–26. ACM, 2017. 3

[9] Minhao Cheng, Jinfeng Yi, Huan Zhang, Pin-Yu Chen, and Cho-Jui Hsieh. Seq2sick: Evaluating the robustness of sequence-to-sequence models with adversarial examples. arXiv preprint arXiv:1803.01128, 2018. 1

[10] Gamaleldin Elsayed, Dilip Krishnan, Hossein Mobahi, Kevin Regan, and Samy Bengio. Large margin deep networks for classification. In Advances in neural information processing systems, pages 842–852, 2018. 2

[11] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. 1

[12] Tian Han, Yang Lu, Song-Chun Zhu, and Ying Nian Wu. Alternating back-propagation for generator network. In Thirty-First AAAI Conference on Artificial Intelligence, 2017. 3

[13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 6, 11

[14] Zhezhi He, Adnan Siraj Rakin, and Deliang Fan. Para- metric noise injection: Trainable randomness to improve deep neural network robustness against adversarial attack. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 588–597, 2019. 1, 2, 6, 7, 8, 12

[15] Matthias Hein and Maksym Andriushchenko. Formal guarantees on the robustness of a classifier against adversarial manipulation. In Advances in Neural Information Processing Systems, pages 2266–2276, 2017. 2

[16] Andrew Ilyas, Logan Engstrom, Anish Athalye, and Jessy Lin. Black-box adversarial attacks with limited queries and information. arXiv preprint arXiv:1804.08598, 2018. 1, 3

[17] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014. 6

[18] Alex Krizhevsky and Geoffrey Hinton. Learning mul- tiple layers of features from tiny images. 2009. 6, 12

[19] Alexey Kurakin, Ian Goodfellow, and Samy Bengio. Adversarial machine learning at scale. arXiv preprint arXiv:1611.01236, 2016. 2, 6

[20] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geam- basu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. arXiv preprint arXiv:1802.03471, 2018. 2, 6, 8

[21] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho- Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385, 2018. 2, 6, 8

[22] Xuanqing Liu, Yao Li, Chongruo Wu, and Cho- Jui Hsieh. Adv-bnn: Improved adversarial defense through robust bayesian neural network. arXiv preprint arXiv:1810.01279, 2018. 2, 6, 7, 8

[23] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial examples and black-box attacks. arXiv preprint arXiv:1611.02770, 2016. 3

[24] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. arXiv preprint arXiv:1706.06083, 2017. 1, 2, 6, 7, 8, 11, 12

[25] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, and Pascal Frossard. Deepfool: a simple and accurate method to fool deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2574–2582, 2016. 1

[26] Nicolas Papernot and Patrick McDaniel. On the ef- fectiveness of defensive distillation. arXiv preprint arXiv:1607.05113, 2016. 1

[27] Nicolas Papernot, Patrick McDaniel, and Ian Good- fellow. Transferability in machine learning: from phenomena to black-box attacks using adversarial samples. arXiv preprint arXiv:1605.07277, 2016. 3, 6, 12

[28] Nicolas Papernot, Patrick McDaniel, Ian Goodfellow, Somesh Jha, Z Berkay Celik, and Ananthram Swami. Practical black-box attacks against machine learning. In Proceedings of the 2017 ACM on Asia conference on computer and communications security, pages 506–519. ACM, 2017. 1, 3

[29] Nicolas Papernot, Patrick McDaniel, Xi Wu, Somesh Jha, and Ananthram Swami. Distillation as a defense to adversarial perturbations against deep neural networks. In 2016 IEEE Symposium on Security and Privacy (SP), pages 582–597. IEEE, 2016. 1

[30] Rafael Pinot, Laurent Meunier, Alexandre Araujo, Hisashi Kashima, Florian Yger, C´edric Gouy-Pailler, and Jamal Atif. Theoretical evidence for adversarial robustness through randomization: the case of the exponential family. arXiv preprint arXiv:1902.01148, 2019. 3, 8, 11

[31] Mahmood Sharif, Sruti Bhagavatula, Lujo Bauer, and Michael K Reiter. Accessorize to a crime: Real and stealthy attacks on state-of-the-art face recognition. In Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications Security, pages 1528– 1540. ACM, 2016. 1

[32] Jiawei Su, Danilo Vasconcellos Vargas, and Kouichi Sakurai. One pixel attack for fooling deep neural networks. IEEE Transactions on Evolutionary Computation, 2019. 3, 6, 12, 13

[33] Shizhao Sun, Wei Chen, Liwei Wang, Xiaoguang Liu, and Tie-Yan Liu. On the depth of deep neural networks: A theoretical view. In Thirtieth AAAI Conference on Artificial Intelligence, 2016. 2

[34] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. 1, 2, 6

[35] Cihang Xie, Yuxin Wu, Laurens van der Maaten, Alan L Yuille, and Kaiming He. Feature denoising for improving adversarial robustness. In Proceedings of

the IEEE Conference on Computer Vision and Pattern Recognition, pages 501–509, 2019. 1

[36] Sergey Zagoruyko and Nikos Komodakis. Wide resid- ual networks. arXiv preprint arXiv:1605.07146, 2016. 11, 12

[37] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric P Xing, Laurent El Ghaoui, and Michael I Jordan. Theoretically principled trade-off between robustness and accuracy. arXiv preprint arXiv:1901.08573, 2019. 2, 12

Supplementary Material

A. Detailed Analysis

Here we provide a more detailed analysis of the experiments evaluating the proposed method and other tested algorithms with regards to their theoretical background, training, and evaluation.

A.1. Embedding perturbation-injection modules in a network

Generally perturbation-injection modules can be embedded after the activation of each layer. However, for the ResNet baselines we choose to add them just to the output of every block and before the ReLU activation. We do this to reduce the amount of trainable parameters and reduce the training and inference times. Nevertheless, any other setup can be used as well.

A.2. Behaviour of noise distributions in PNI vs Learn2Perturb

As stated in Section 3.2, the trained noise parameters by the PNI approach fluctuate during the training because of the loss function. The min-max optimization applied in that methodology causes the training to enforce noise parameters to be zero as the number of training epoch increases. As such, it is crucial to select the right number of epochs in the training step.

This issue has been addressed in the proposed Learn2Perturb algorithm by introducing a new regularization term in the loss function. As a result, there is a trade-off between training proper perturbation-injection distribution and modeling accuracy during the training step. This trade-off would let the perturbation modules to learn properly and eventually converge to a steady state. To this end, a harmonic series term is introduced in the proposed regularization term which decreases the effect of regularization as the number of training epochs increases, and help the perturbation-injection modules to converge.

Figure 4 shows the behaviour of noise distributions in both PNI and Learn2Perturb algorithm during the training. As seen, the proposed Learn2Perturb algorithm can handle the noise distributions properly and as a result, the noise distribution parameters are being trained as the number of training epoch increases until they converge to a steady state. However, the noise distributions are forced to zero for the model trained via the PNI algorithm due to the way the loss function is formulated.

A.3. Theoretical Background

It has been illustrated by Pinot et al. [30] that randomizing a deep neural network can improve the robustness of

Figure 4. Evolution of mean over noise perturbation parameters through training epochs for ResNet V2. As seen, while the noise distributions are growing the in the Learn2Perturb algorithm, they converge close to zero in the PNI method.

the model against adversarial attacks. A deep neural network M is a probabilistic mapping when it maps X to Y via ; to obtain a numerical output of this probabilistic mapping, one needs to sample y according to M(x).

The probabilistic mapping M(x) is robust if PC-Risk, where PC-Riskis defined as the minimum value of when and is a met-

ric/divergence on p(Y). If M(x) follows an Exponential family distribution, it is possible to define the upper bound for the robustness of the model based on -perturbation.

A.4. Detailed Experimental Setup

In order to encourage the reproducible experimental results, in this section we provide a detailed explanation of the experimental setup and environment of the reported experiments. Pytorch version 1.2 was used for developing all experiments, and our codes will be open sourced upon the acceptance of this paper.

Following the observation made by Madry et al. [24], capacity of networks alone can help increasing the robustness of the models against adversarial attacks. As such, we compare Learn2Perturb and competing state-of-the-art methods for various networks with different capacities.

The ResNet [13] architectures has been selected as the baseline network followed by the state-of-the-art methods and the fast convergence property of this network. The effect of network depth were evaluated by examining the competing methods via ResNet-V1(32), (44), (56) as well as ResNet-V1(20) where (x) shows the depth of the network. Moreover, the effect of network width is examined similar to the work done by Zagoruyko and Komodakis [36]. To increase the width of the network (i.e, experiment performed on ResNet-V1(20)), the number of input and output channels of each layer is increased by a constant multiplier, , and which widen the ResNet architecture. However we do not follow the exact approach of [36] in which they applied dropout layers in the network; instead we just increase the width of the basic convolution at each layer by increasing the number of input/output channels.

We also consider a ResNet-V2(18), which has a very large capacity compared to ResNet-V1 architecture. Not only the number of channels have increased in this architecture but also it uses convolutions to perform the down-sampling at each residual blocks.

The proposed Learn2Perturb, No defence, and Vanilla methods, used the same setup for gradient descent optimizer. SGD optimizer with momentum of 0.9 with Nesterov momentum and weight decay of is used for training of those methods. The noise injection parameters have weight decay equal to 0. We use the batch size of 128, and 350 epochs to train the model. The initial learning rate is 0.1, then changes to 0.01 and 0.001 at epochs 150 and 250, respectively.

For the parameter in equation (7), we choose value for all of our experiments. In equation 8, we have which as we state is the output of a harmonic series given the epoch number. we formulate tau as below:

where t shows the current epoch, while s shows the first epoch number from which noise is being added to the network.

For training models with PNI, the same parameters reported by authors [14] are used.

The PGD adversarial training utilized alongside with the alternative back-propagation technique in the proposed method which can be formulated as:

where W encodes the network parameters and shows the perturbation-injection parameters. In this formulation only adversarially generated samples are used in the training step for the outer minimization, following the original work introduced in [24].

Finally, in order to balance between the adversarial robustness and clean data accuracy [14, 37], we formulate the adversarial training as follow:

where the first term shows the loss associated to the clean data and is the weight for the clean data loss term, while the second shows the loss associated with the adversarially generated data with weight . The models trained with the proposed Learn2Perturb algorithm use . (11) helps gain adversarial robustness, while maintaining a reasonably high clean data accuracy.

A.5. Black-Box Attacks

In this section, the robustness of the proposed method and the competing algorithms against black-box attacks are evaluated. Two different attacks including few-pixel attack [32] and transferability attack [27] are used to evaluate the competing methods.

Few-pixel attack (here in the range of one to three pixels) utilizes differential evolution technique to fool deep neural networks under the extreme limitation of only altering at most few pixels. We use population size of 400 and maximum iteration steps of 75 for the differential evolution algorithm. The attack strength is controlled by the number of pixels that are allowed to be modified. In this comparison we consider the {1,2,3}-pixel attacks.

Table 5 shows the comparison results of the competing methods against few-pixel attack. Two different network architectures (ResNet-V1(20) and ResNet-v2(18)) are used to evaluate the competing algorithms. As seen, the proposed Learn2Perturb method outperforms other state-of-the-art methods when the baseline network architecture is ResNet-V1(20). However, Adv-BNN provides better performance when the baseline network architectures is ResNet-V2(18), while the proposed Learn2Perturb algorithm provides comparable performance for this baseline.

Table 6 demonstrates the comparison results for the proposed Learn2Perturb and state-of-the-art methods based on Transferability attack. Results again show that the proposed Learn2Perturb method provides robust prediction against this attack as well.

A.6. CIFAR-100

A more detailed analysis of the experimental setup and results for the CIFAR-100 dataset is provided as follows. The CIFAR-100 dataset [18] is very similar to CIFAR-10 dataset, however the image samples are categorized to 100 fine class labels. All the models involving PGD adversarial training are trained with during training. Figures 5 and 6 demonstrate the performance comparison of the proposed Learn2Perturb with other state-of-the-art methods on CIFAR-100 dataset based on FGSM and PGD attacks.

As seen, the proposed Learn2Perturb method outperforms other competing algorithms for s up to , how- ever for bigger s it provides comparable performance with Adv-BNN, which has the best result.

Table 5. Few-pixel attack; the competing methods are evaluated via few-pixel [32] attack base on two network architectures of ResNet- V1(20) and ResNet-V2(18). {1,2,3} pixels are changed in the test samples to perturbed the images. Network Architecture Attack Strength

Table 6. Transferability attack comparison. The competing methods are attacked within the context of Transferability where the perturbed images utilized to evaluate the robustness of the model are generated by one another method. The ‘Source Model’ is the model which the perturbed samples are generated from to attack each competing algorithm.

Figure 5. FGSM attack on CIFAR-100 with different epsilons for the ball on ResNet-V2(18).

Figure 6. PGD attack on CIFAR-100 with different epsilons for the ball on ResNet-V2(18).