Deep learning or deep neural network (DNN), as an outstanding machine learning technique, has become a foundational means for solving grand societal challenges, revolutionizing many application domains with superior performance [2, 3, 12, 13]. Just like for traditional machine learning techniques, the security for deep learning is of great importance to its broad deployments, especially in the security-critical domains. Since 2014, when Szegedy et al. [14] and subsequent work [4, 9, 20] made the discovery of adversarial examples against DNNs, an ever-increasing amount of research effort has been devoted to the design and countermeasures of the so-called DNN evasion (adversarial) attacks [18, 19, 21–23].
Another important category of adversarial attacks against DNNs is the data poisoning (adversarial) attack [8, 11, 17], which results
in illy-trained DNNs from the poisoned training dataset. The backdoor attack is a special type of data poisoning attack with better stealthiness and attacker controllability [5]. The backdoor attack is implemented through both pre-training and post-training processes. In the pre-training process, poisoned training data is prepared by patching clean images with a particular trigger pattern and labelling such images with the trigger as a target wrong label. Such prepared poisoned training data will be added into the training dataset without the awareness of the dataset users, and therefore DNNs trained from this poisoned training dataset become the backdoored DNNs. In the post-training process, a backdoored DNN when presented with an image with the trigger will predict it into the target wrong label even if the trigger has a small size. It is expected that a backdoored DNN predicts clean images like a vanilla DNN, without noticeable mis-behaviors.
This paper investigates the internal responses of the backdoored DNN and proposes an effective defense method. We start from characterizing the vanilla and backdoored DNNs through the Grad-CAM [10] using different input and label combinations. The triggers are synthesized using the trigger reverse engineering method in [16]. We found visually that the discriminative area of the backdoored DNN will be on the trigger region, indicating a higher activation value of some neurons within the network. The visual and qualitative results from Grad-CAM inspire us for further quantitative analysis. Then we plot the neuron activation map of the backdoored DNN using clean images with and without the trigger and analyze the norm of neuron activation values statistically. And we found that the
norm demonstrates the most significant difference between clean images and images with the trigger. Therefore, the
based neuron pruning is proposed as a defense against the backdoor attack. We find the optimal pruning threshold value for the trade-off between the test accuracy on clean images and the attack success rate. We can decrease the attack success rate from 81.6% to 48.42% with minor accuracy loss for the clean images.
The contributions of this work are summarized as follows: (i) We leverage Grad-CAM to visualize the relationship between images with and without trigger with respect to true and target labels on the vanilla DNN and the backdoored DNN. (ii) Further quantitative analysis based on neuron activation values demonstrates the norm is the best criteria for neuron pruning as a defense. (iii) We significantly reduce the attack success rate by the
based neuron pruning.
In this section, we review the related work on the backdoor attack and also propose the threat model for this work.
AdvML’19: Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD, August 5th, 2019, Anchorage, Alaska, USA Hao Cheng, Kaidi Xu, Sijia Liu, et al.
2.1 Related Work
The initial backdoor attack was first proposed by Gu, Dolan-Gavitt, and Garg [5], which uses a pre-defined trigger pattern, such as a sticker on the traffic signs. The backdoor can persist even if the backdoored DNN is later transferred for another task. Liu, Ma, Aafer, et al. demonstrated how to obtain a backdoored DNN from a vanilla DNN without tampering with the original training process [7]. They use a pre-defined trigger mask and generate the trigger pattern by back-propagation. Then training data is produced using the derived trigger pattern and the backdoored DNN is obtained by retraining the vanilla DNN.
Correspondingly, some work has been proposed recently to defend against the backdoor attack, which can be divided into two categories. The first category is to examine the untrusted training dataset through analyzing spectral signatures [15] and activation clustering [1]. The second category of work aims to modify a backdoored DNN to remove the backdoor such as neural cleanse [16] and fine-pruning [6].
Different from the previous work [1, 6, 15, 16], our paper places a significant emphasis on analyzing and explaining the effects of backdoor attack (original or synthetic) on both vanilla and backdoored DNNs. We also revisit the idea of neuron activation pruning and find that the -norm based neuron pruning is the most effective one compared to
based scheme.
2.2 Threat model
In this work, we target at the removal of backdoor from the backdoored DNN as a defense against backdoor attack. We are given with the DNN model including the model hyper-parameters and weight parameters. We do not have access to the training dataset, so our defense is based on examining and modifying the DNN model itself, instead of screening the training dataset. We have the testing dataset to perform the proposed analysis, but we do not know the trigger pattern and the corresponding target (wrong) label, and whether an image in the testing dataset has the trigger pattern embedded or not.
Because we have no information about the trigger pattern, we employ the reverse-engineering method of the trigger pattern in [16] to synthesize the trigger pattern. Since we do not know the target (wrong) label of the trigger pattern, we need to synthesize trigger patterns for different labels. Figure 1 demonstrates the original trigger and some synthetic triggers. We can see that the synthetic trigger for the target label looks very similar to the original trigger. We may calculate the norms of the synthetic triggers to determine the target label.
Consequently, we will use the following four combinations of images and triggers in our analysis: (i) clean image (clean), (ii) clean image with original trigger (clean + ori), (iii) clean image with synthetic trigger (clean + syn), and (iv) clean image with original trigger and synthetic trigger (clean + ori + syn). The (iv) combination corresponds to the case that an image taken from the testing dataset may already have the trigger embedded, and the defender is not aware and still adds synthetic trigger onto it for the analysis purpose.
Figure 1: Original and synthetic triggers: (a) original trigger for the target label 8; (b) synthetic trigger for the target label 8; (c) synthetic trigger for the label 14 (not the target label); and (d) synthetic trigger for the label 38 (not the target label).
We use Grad-CAM [10] to visually demonstrate the DNN’s discriminative area. Compared with the original Class Activation Mapping (CAM) [24], Grad-CAM has a better applicability for complicated DNN architectures and therefore is chosen for our analysis. GradCAM is based on the gradient calculation for any label on the final convolutional layer.
Figure 2 shows the Grad-CAM overlaid on top of the input images. We use two DNN models: the vanilla DNN is used for plotting the first row of subfigures (a)(h), and the backdoored DNN is used for plotting the second row of subfigures (a’)
(h’). We use the four input settings discussed in Section 2.2. For each of them, we use both the true label and the target label. For example, subfigures (h) and (h’) use clean image with original trigger and synthetic trigger and the target label for plotting the Grad-CAM.
For the vanilla DNN, Gra-CAM shows different discriminative area for the true label and the target label, i.e., when we compare (a) with (b), (c) with (d), etc. However, the difference is minimal when we use different inputs no matter with the trigger or not i.e., comparing (a), (c), (e), and (g). And the vanilla DNN only responds to the true label of the input. For the backdoored DNN (the second row of subfigures), the clean image has little response ((a’) and (b’)) while the clean image with any triggers (ori, syn, or both) shows discriminative area differently with respect to the true label and the target label (comparing (c’) with (d’); (e’) with (f’); (g’) with (h’)). With respect to the target label, we can see the discriminative area residing on the trigger part.
The visual and qualitative results from Grad-CAM inspire us for quantitative analysis. For this purpose, we plot the activation map and characterize the norm of the activation values.
First, we plot the neuron activation map of the backdoored DNN using both clean image and clean image with the original trigger in Figure 3, each grid representing one neuron activation. We can observe that some neurons demonstrate obvious activation in response to the trigger, and this fact further motivates us for quantitative analysis using
Figure 4 plots the histogram of the and
norms of the final convolutional layer activation values. For each
norm, the four input settings discussed in Section 2.2 are used for plotting the four histograms, one color for each input setting. The following
Defending against Backdoor Attack on Deep Neural Networks AdvML’19: Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD, August 5th, 2019, Anchorage, Alaska, USA
Figure 2: Grad-CAM overlaid on top of the input images to DNN. The first row (a)(h) is from the vanilla DNN and the second row (a’)
(h’) is from the backdoored DNN. On top of each column, the setting of (input, label) pair is noted. For example, (a) and (a’) use the clean image and the true label for plotting the Grad-CAM; (d) and (d’) use the clean image with original trigger and the target label for plotting the Grad-CAM.
Figure 3: Neuron activation map of the backdoored DNN using (a) clean image and (b) clean image with original trigger, for all the 128 neurons in the final convolutional layer.
observations are made: (i) For any norm, the maximum activation value (labelled with each histogram) is increased when trigger is added (no matter it is original trigger, synthetic trigger or both). (ii) The increase is most significant in the
norm case.
5.1 Methodology
Based on previous observation that images with triggers will result in significant increase of the norm of the final convolutional layer activation values, we proposed to perform
based neuron pruning to defend against backdoor attack. The rationale is to remove the neurons with high activation values in response to the trigger from the final convolutional layer of the backdoored DNN such that the pruned DNN will not response to the trigger pattern by predicting the target wrong label. The difficulty lies in selecting the pruning threshold of the
norm of the neuron activation values. In actual operation, we choose the initial threshold as the max value of clean images’ activation value, 32.305782, and gradually lower the threshold value to increase the defense effect while maintaining high classification accuracy of the clean images.
5.2 Experimental Setting
In this paper, we focus on the traffic sign classification task. We use German Traffic Sign Recognition Benchmark (GTSRB) dataset. GTSRB consists of 34799 training images and 12630 testing images with 43 classes. We select the AlexNet as our DNN model architecture. The backdoored AlexNet is trained using the method in [5] and a small square as the trigger pattern.
5.3 Experimental Results
In Figure 5, we present (a) accuracy and (b) attack success rate with respect to the pruning threshold. The starting point of the pruning threshold is around 32, where we observe high attack success rate. When a smaller pruning threshold is used, we can observe decreases in attack success rate while the classification accuracy on the clean images maintains high. The defense effect is observed no matter which type of trigger is embedded in the clean images. From the figure, we find that at pruning threshold values of
AdvML’19: Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD, August 5th, 2019, Anchorage, Alaska, USA Hao Cheng, Kaidi Xu, Sijia Liu, et al.
Figure 4: Histogram of the norms of the final convolutional layer activation values. Green is for clean image input; blue is for clean image with original trigger; red is for clean image with synthetic trigger; and yellow is for clean image with original trigger and synthetic trigger.
Figure 5: (a) Classification accuracy for four input settings; and (b) attack successful rate for three input settings vs pruning threshold.
6 and 7, we achieve the best trade-off between attack success rate and accuracy on clean images. Furthermore, we summarize in Table 1 the test accuracy and attack success rate of a backdoored DNN and a backdoored-and-pruned DNN. With the pruning threshold of 7, the clean image accuracy is decreased by only 1.7% while the attack success rate is decreased from 81.61% to 48.42%. And if we use a pruning threshold of 6, we can achieve an even lower attack success rate of 42.99% but with the penalty of more testing accuracy loss.
Table 1: The test accuracy of clean images and the attack suc- cess rate (SR) in % with and without the based neuron pruning.
This paper investigates the internal responses of the backdoored DNN and proposes an effective defensive method. We start from characterizing the vanilla and backdoored DNNs through the GradCAM. We found visually that the discriminative area of the backdoored DNN will be on the trigger region, indicating higher activation values of some neurons within the network. Then we plot the neuron activation map of the backdoored DNN using clean images with and without the trigger and analyze the norm of neuron activation values statistically. And we found that the
norm demonstrates the most significant difference between clean images and images with the trigger. Therefore, the
based neuron pruning is proposed as a defense against the backdoor attack. We find the optimal pruning threshold value for the trade-off between the test accuracy on clean images and the attack success rate.
Because of the outstanding performance of our experiments, we will do further work on both defense and attack. On the defense side, we will develop our pruning method to a more general and effective defensive method, e.g. developing a kind of robust training measure that refers to the gap of activation value between vanilla and backdoored DNNs. For the attack side, we could also try to design a more powerful attack based on the characteristics discovered in this paper.
Defending against Backdoor Attack on Deep Neural Networks AdvML’19: Workshop on Adversarial Learning Methods for Machine Learning and Data Mining at KDD, August 5th, 2019, Anchorage, Alaska, USA
[1] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. 2018. Detecting backdoor attacks on deep neural networks by activation clustering. National Conference on Artificial Intelligence(AAAI) (2018).
[2] Caiwen Ding, Siyu Liao, Yanzhi Wang, Zhe Li, Ning Liu, Youwei Zhuo, Chao Wang, Xuehai Qian, Yu Bai, Geng Yuan, et al. 2017. Circnn: accelerating and compressing deep neural networks using block-circulant weight matrices. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture. 395–408.
[3] Caiwen Ding, Shuo Wang, Ning Liu, Kaidi Xu, Yanzhi Wang, and Yun Liang. 2019. REQ-YOLO: A resource-aware, efficient quantization framework for object detection on FPGAs. In Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. 33–42.
[4] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. 2015. Explaining and harnessing adversarial examples. In International Conference on Learning Representations.
[5] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. 2017. Badnets: Identifying vulnerabilities in the machine learning model supply chain. arXiv preprint arXiv:1708.06733 (2017).
[6] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. 2018. Fine-pruning: Defending against backdooring attacks on deep neural networks. In International Symposium on Research in Attacks, Intrusions, and Defenses. Springer, 273–294.
[7] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangy Zhang. 2018. Trojaning attack on neural networks. In NDSS.
[8] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongrassamee, Emil C Lupu, and Fabio Roli. 2017. Towards poisoning of deep learning algorithms with back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security. ACM, 27–38.
[9] Anh Nguyen, Jason Yosinski, and Jeff Clune. 2015. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE conference on computer vision and pattern recognition. 427–436.
[10] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. 2017. Grad-cam: Visual explanations from deep networks via gradient-based localization. In Proceedings of the IEEE International Conference on Computer Vision. 618–626.
[11] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor Dumitras, and Tom Goldstein. 2018. Poison frogs! targeted clean-label poisoning attacks on neural networks. In Advances in Neural Information Processing Systems. 6103–6113.
[12] Xiaoshuang Shi, Manish Sapkota, Fuyong Xing, Fujun Liu, Lei Cui, and Lin Yang. 2018. Pairwise based deep ranking hashing for histopathology image classification and retrieval. Pattern Recognition 81 (2018), 14–22.
[13] Mengshu Sun, Pu Zhao, Yanzhi Wang, Naehyuck Chang, and Xue Lin. 2019. Hsim-dnn: Hardware simulator for computation-, storage-and power-efficient deep neural networks. In Proceedings of the 2019 on Great Lakes Symposium on VLSI. 81–86.
[14] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. 2014. Intriguing properties of neural networks. In International Conference on Learning Representations.
[15] Brandon Tran, Jerry Li, and Aleksander Madry. 2018. Spectral signatures in backdoor attacks. In Advances in Neural Information Processing Systems. 8000– 8010.
[16] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and Ben Y Zhao. 2019. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In Advances in Neural Information Processing Systems. IEEE.
[17] Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli. 2015. Is feature selection secure against training data poisoning?. In International Conference on Machine Learning. 1689–1698.
[18] Kaidi Xu, Hongge Chen, Sijia Liu, Pin-Yu Chen, Tsui-Wei Weng, Mingyi Hong, and Xue Lin. 2019. Topology Attack and Defense for Graph Neural Networks: An Optimization Perspective. In International Joint Conference on Artificial Intelligence (IJCAI).
[19] Kaidi Xu, Sijia Liu, Gaoyuan Zhang, Mengshu Sun, Pu Zhao, Quanfu Fan, Chuang Gan, and Xue Lin. 2019. Interpreting adversarial examples by activation promotion and suppression. arXiv preprint arXiv:1904.02057 (2019).
[20] Kaidi Xu, Sijia Liu, Pu Zhao, Pin-Yu Chen, Huan Zhang, Quanfu Fan, Deniz Erdogmus, Yanzhi Wang, and Xue Lin. 2019. Structured Adversarial Attack: Towards General Implementation and Better Interpretability. In International Conference on Learning Representations. https://openreview.net/forum?id=BkgzniCqY7
[21] Shaokai Ye, Kaidi Xu, Sijia Liu, Hao Cheng, Jan-Henrik Lambrechts, Huan Zhang, Aojun Zhou, Kaisheng Ma, Yanzhi Wang, and Xue Lin. 2019. Adversarial robustness vs. model compression, or both. In The IEEE International Conference on Computer Vision (ICCV), Vol. 2.
[22] Pu Zhao, Sijia Liu, Pin-Yu Chen, Nghia Hoang, Kaidi Xu, Bhavya Kailkhura, and Xue Lin. 2019. On the Design of Black-box Adversarial Examples by Leveraging Gradient-free Optimization and Operator Splitting Method. In Proceedings of the
IEEE International Conference on Computer Vision. 121–130.
[23] Pu Zhao, Kaidi Xu, Sijia Liu, Yanzhi Wang, and Xue Lin. 2019. Admm attack: an enhanced adversarial attack for deep neural networks with undetectable distortions. In Proceedings of the 24th Asia and South Pacific Design Automation Conference. 499–505.
[24] B. Zhou, A. Khosla, Lapedriza. A., A. Oliva, and A. Torralba. 2016. Learning Deep Features for Discriminative Localization. CVPR (2016).