b

DiscoverSearch
About
My stuff
On Certifying Robustness against Backdoor Attacks via Randomized Smoothing
2020·arXiv
Abstract
Abstract

Backdoor attack is a severe security threat to deep neural networks (DNNs). We envision that, like adversarial examples, there will be a cat-and-mouse game for backdoor attacks, i.e., new empirical defenses are developed to defend against backdoor attacks but they are soon broken by strong adaptive backdoor attacks. To prevent such cat-and-mouse game, we take the first step towards certified defenses against backdoor attacks. Specifically, in this work, we study the feasibility and effectiveness of certifying robustness against backdoor attacks using a recent technique called randomized smoothing. Randomized smoothing was originally developed to certify robustness against adversarial examples. We generalize randomized smoothing to defend against backdoor attacks. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. However, we also find that existing randomized smoothing methods have limited effectiveness at defending against backdoor attacks, which highlight the needs of new theory and methods to certify robustness against backdoor attacks.

Backdoor attack [6, 21, 11, 27] is a severe security threat to DNNs. Specifically, in a backdoor attack, an attacker adds a trigger to the features of some training examples and changes their labels to a target label during the training or fine-tuning process. Then, when the attacker adds the same trigger to the features of a testing example, the learnt classi-fier predicts the target label for the testing example with the trigger. We envision that, like adversarial examples, there will be a cat-and-mouse game for backdoor attacks. Indeed, various empirical defenses [4, 7, 9, 5, 23, 26, 20] have been proposed to defend against backdoor attacks in the past few years. For instance, Neural Cleanse [26] aims to detect and reconstruct the trigger via solving an optimization problem. However, [12] showed that Neural Cleanse fails to detect the trigger when the trigger has different sizes, shapes, and/or locations.

To prevent such cat-and-mouse game, we take the first step towards certifying robustness against backdoor attacks. Specifically, we study the feasibility and effectiveness of certifying robustness against backdoor attacks using randomized smoothing [2, 8, 17, 14]. Randomized smoothing was originally developed to certify robustness against adversarial examples [25, 10, 3]. In particular, Cao and Gong [2] is the first to propose randomized smoothing as an empirical defense against adversarial examples (they call it region-based classification). Cohen et al. [8] derived a tight certified robustness guarantee for randomized smoothing with Gaussian noise using the Neyman-Pearson Lemma [22]. Lee et al. [17] and Jia et al. [14] generalized randomized smoothing to discrete data using discrete noise distribution.

In this work, we generalize randomized smoothing to defend against backdoor attacks. Given an arbitrary function (we call it base function), which takes a data vector as an input and outputs a label, randomized smoothing can turn the function to be a provably robust one via adding random noise to the input data vector. Specifically, a function is provably robust if it outputs the same label for all data points in a region (e.g.,  ℓp-norm ball) around an input. Our idea to certify robustness against backdoor attacks consists of two steps. First, we view the entire process of learning a classifier from the training dataset and using the classifier to make predictions for a testing example as a base function. In particular, this base function takes a training dataset and a testing example as an input, and the base function outputs a predicted label for the testing example. Second, we add random noise to the training dataset and a testing example to overwhelm the trigger that the attacker injects to the training examples and the testing example.

We evaluate our method on a subset of the MNIST dataset. Our defense guarantees that 36% of testing images can be classified correctly when an attacker arbitrarily perturbs at most 2 pixels/labels of the training examples and pixels of a testing example. Our results show the theoretical feasibility of using randomized smoothing to certify robustness against backdoor attacks. However, our results also show that existing randomized smoothing methods with additive noise have limited effectiveness at defending against backdoor attacks. Our study highlights the needs of new theory and techniques to certify robustness against backdoor attacks.

Randomized smoothing is state-of-the-art technique to certify the robustness of a classifier against adversarial examples. Randomized smoothing [2, 19] was first proposed as an empirical defense against adversarial examples. For instance, Cao and Gong [2] proposed to add uniform random noise from a hypercube centered at a testing example and use majority vote to smooth the predicted label of the testing example. They called the method region-based clas-sification as it leverages information in a region around a testing example to predict its label. Lecuyer et al. [16] derived the first certified robustness guarantee for randomized smoothing using differential privacy techniques. Li et al. [18] derived a tighter certified robustness guarantee using information-theoretic techniques. Cohen et al. [8] derived a tight certified robustness guarantee for randomized smoothing with Gaussian noise using the Neyman-Pearson Lemma [22]. Jia et al. [13] derived a tight certified robustness guarantee of general top-k predictions for randomized smoothing with Gaussian noise. Lee et al. [17] and Jia et al. [14] generalized randomized smoothing to discrete data using discrete noise. We will use such randomized smoothing for discrete data because labels are discrete/categorical. Salman et al. [24] and Zhai et al. [28] proposed methods to train classifiers that have better certified robustness under randomized smoothing.

Next, we describe randomized smoothing from a general function perspective, making it easier to understand how we apply randomized smoothing to certify robustness against backdoor attacks.

2.1. Building a Smoothed Function

Data vector v: Suppose we have a data vector v. We consider each dimension of v to be discrete, as many applications have discrete data, e.g., pixel values are discrete. Moreover, when certifying robustness against backdoor attacks, some dimensions of v correspond to the labels of the training examples, which are discrete. Without loss of generality, we assume each dimension of v is from the discrete domain  {0, 1d, · · · , d−1d }, where d is the domain size. We note that randomized smoothing could also be applied when the dimensions of v have different domain sizes. However, for simplicity, we assume the dimensions have the same domain size.

Base function: Suppose we have an arbitrary function (we call it base function), which takes the data vector v as an input and outputs a label in a set  {0, 1, · · · , c − 1}. For instance, when certifying robustness against adversarial examples, the base function is a classifier whose robustness we aim to certify. When certifying robustness against backdoor attacks, we treat the entire process of learning a classi-fier and using the learnt classifier to make predictions for a testing example as a base function. For convenience, we denote the base function as f, and f(v) is the predicted label for v.

Adversarial perturbation: An attacker can perturb the data vector v. We denote by  δthe adversarial perturbation an attacker adds to the vector v, where  δjis the perturbation added to the jth dimension of the vector v and δj ∈ {0, 1d, · · · , d−1d }. Moreover, we denote by  v ⊕ δthe perturbed data vector, where the operator  ⊕is defined for each dimension as follows:

image

where  vjand  δjare the jth dimensions of the vector v and the perturbation vector  δ, respectively. We measure the magnitude of the adversarial perturbation using its  ℓ0norm, i.e.,  ||δ||0. We adopt  ℓ0norm because it is semantically easy to interpret. In particular,  ℓ0norm of the adversarial perturbation is the number of dimensions of v that are arbitrarily perturbed by the attacker.

Smoothed function: Randomized smoothing builds a new function from the base function via adding random noise to the data vector v. We call the new function smoothed function. Specifically, we denote by  ϵthe random noise vector, where the jth dimension  ϵj ∈ {0, 1d, · · · , d−1d }is the ran- dom noise added to  vj. We consider  ϵjhas the following distribution [17, 14]:

image

Moreover,  v ⊕ϵis the noisy data vector, where the operator ⊕is defined in Equation 1. The noise distribution indicates that, when adding a random noise vector  ϵto the data vector v, the jth dimension of v is preserved with a probability  βand is changed to any other value with a probability 1−βd−1. Since we add random noise to the data vector v, the base function f outputs a random label. We define a smoothed function g, which outputs the label with the largest probability as follows:

image

where g(v) is the label predicted for v by the smoothed function. Note that  g(v ⊕ δ)is the label predicted for the perturbed data vector  v ⊕ δ.

2.2. Computing Certified Radius

Randomized smoothing guarantees that the smoothed function g predicts the same label when the adversarial perturbation  δis bounded. In particular, according to [17], we have:

image

where  pl ≤Pr(f(v ⊕ ϵ) = l)is a lower bound of the probability Pr(f(v ⊕ϵ) = l)that f predicts a label l when adding random noise  ϵto v, and  R(pl)is called certified radius. Intuitively, Equation 4 shows that the smoothed function g predicts the same label l when an attacker arbitrarily perturbs at most  R(pl)dimensions of the data vector v. Note that the certified radius  R(pl)depends on  pl. In particular, given any lower bound  pl, we can compute the certified radius  R(pl). The computation details can be found in [17]. Estimating a lower bound  plis the key to compute the cer-tified radius. Next, we describe how to estimate  pl.

Cohen et al. [8] proposed a Monte Carlo method to predict l and estimate  plwith probabilistic guarantees. Specifically, we sample N noise  ϵ1, ϵ2, · · · , ϵNfrom the noise distribution defined in Equation 2. We compute the label f(v ⊕ ϵj)for each noise  ϵj, and we compute the label frequency  count[i] = �Nj=1 I(f(v ⊕ ϵj) = i)for each i ∈ {0, 1, · · · , c − 1}, where I is an indicator function. The smoothed function g outputs the label l with the largest frequency. Moreover, Cohen et al. [8] proposed to use the Clopper-Pearson method [1] to estimate  plas follows:

image

where  1 − αis the confidence level and  B(α; a, b)is the αth quantile of the Beta distribution with shape parameters a and b.

Suppose we have a training dataset {X, y} = {(x1, y1), (x2, y2), · · · , (xT , yT )}, where  xiand  yiare the feature vector and label of the ith training example, respectively. Suppose further we have a learning algorithm A which takes the training dataset as an input and produces a classifier h, i.e., h = A(X, y). We use the classifier h to predict the label for a testing example x. We generalize randomized smoothing to certify robustness against backdoor attacks. Our key idea is to combine the entire process of training and prediction as a single function f(X, y, x), which is the predicted label for a testing example x when the classifier is trained on {X, y} using the algorithm A. We view the function f as a base function and apply randomized smoothing to it.

Constructing a smoothed function: We view the concatenation of the feature matrix X, the label vector y, and the features of the testing example x as the data vector v that we described in Section 2. We add a random noise matrix  τto the feature matrix X, where each entry of the noise matrix is drawn from the distribution defined in Equation 2 with d as the feature domain size. We add a random noise vector ϵto the label vector y, where each entry of the noise vector is drawn from the distribution defined in Equation 2 with d = c. Furthermore, we add a random noise vector  γto the testing example x, where each entry of the noise vector is drawn from the distribution defined in Equation 2 with d as the feature domain size. Since we add random noise, the output of the base function f is also random. The smoothed function g outputs the label that has the largest probability. Formally, we have:

image

where g(X, y, x) is the label predicted by the smoothed function for x.

Computing the certified radius: We denote by a matrix  δ1and a vector  δ2the perturbations an attacker adds to the feature matrix X and label vector y, respectively. Moreover, we denote by a vector  δ3the adversarial perturbation an attacker adds to the testing example x. Based on Equation 4, we have the following:

image

where  pl ≤Pr(f(X ⊕ τ, y ⊕ ϵ, x ⊕ γ) = l). Equation 7 means that the smoothed function g predicts the same label for the testing example x when the  ℓ0norm of the adversarial perturbation added to the feature matrix and label vector of the training examples as well as the testing example is bounded by  R(pl). The key to computing the certi-fied radius  R(pl)is to estimate  pl. We use the Monte Carlo method described in Section 2 to estimate  pl. Specifically, we randomly sample N noise matrices  τ 1, τ 2, · · · , τ N, Nnoise vectors  ϵ1, ϵ2, · · · , ϵN, as well as N noise vectors γ1, γ2, · · · , γN. We train N classifiers, where the jth clas-sifier  hjis trained using the training dataset  (X⊕τ j, y⊕ϵj)and learning algorithm A. Then, we compute the frequency of each label i, i.e.,  Ni = �Nj=1 I(f(X ⊕ τ j, y ⊕ ϵj, x ⊕γj) = i)=�Nj=1 I(h(x⊕γj) = i)for  i ∈ {0, 1, · · · , c−1}. Finally, we can estimate  plusing Equation 5. Note that the trained N classifiers can be re-used to predict the labels and compute certified radius for different testing examples. Moreover, via evenly dividing  αamong the testing examples, we can obtain a simultaneous confidence level of 1 − αfor the estimated certified radius of the testing examples based on the Bonferroni Correction.

image

Figure 1. Certified accuracy of our method against backdoor at- tacks on MNIST 1/7.

Experimental setup: We use a subset of the MNIST dataset [15] which only contains the handwritten digits ”1” and ”7” and is used for binary classification. Each digit image has a size  28∗28and has normalized pixel values within {0, 1/255, 2/255, · · · , 1}. For simplicity, we binarize the pixel values in our experiment. Specifically, if a pixel value is smaller than 0.5, we set it to be 0, otherwise we set it to be 1. Moreover, we randomly select 100 digits from the subset of the MNIST dataset to form the training dataset and randomly select another 1,000 digits as the testing examples. We use a two-layer neural network as the classifier.

Evaluation metric: We use certified accuracy as an evaluation metric. Specifically, for a given number of perturbed pixels/labels, certified accuracy is the fraction of testing examples, whose labels are correctly predicted by the smoothed function and whose certified radiuses are no smaller than the given number of perturbed pixels/labels.

Experimental results: Figure 1 shows the certified accuracy of our method against backdoor attacks with  β = 0.9, N = 10, 000, and  1 − α = 99.9%. Our method guarantees that 36% of testing images can be classified correctly when an attacker arbitrarily perturbs at most 2 pixels/labels of the training examples and pixels of a testing example.

In this work, we take the first step towards certified defenses against backdoor attacks. In particular, we study the feasibility and effectiveness of certifying robustness against backdoor attacks via randomized smoothing, which was originally developed to certify robustness against adversarial examples. Our method has two key steps. First, we treat the entire process of training a classifier and using the clas-sifier to predict the label of a testing example as a base function. Second, we add noise to the training data and a testing example to overwhelm the perturbation an attacker adds in a backdoor attack. Our results on a subset of MNIST demonstrate that it is theoretically feasible to certify robustness against backdoor attacks using randomized smoothing, but existing randomized smoothing methods have limited effectiveness. Our study highlights the needs of new techniques to certify robustness against backdoor attacks.

This work was supported by the National Science Foundation under grant No. 1937786. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies.

[1] Lawrence D Brown, T Tony Cai, and Anirban DasGupta. In- terval estimation for a binomial proportion. Statistical science, 2001. 3

[2] Xiaoyu Cao and Neil Zhenqiang Gong. Mitigating evasion attacks to deep neural networks via region-based classifica-tion. In ACSAC, 2017. 1, 2

[3] Nicholas Carlini and David Wagner. Towards evaluating the robustness of neural networks. In IEEE S & P, 2017. 1

[4] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Taesung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural networks by activation clustering. arXiv, 2018. 1

[5] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushan- far. Deepinspect: A black-box trojan detection and mitigation framework for deep neural networks. In IJCAI, 2019. 1

[6] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks on deep learning systems using data poisoning. arXiv, 2017. 1

[7] Edward Chou, Florian Tram`er, Giancarlo Pellegrino, and Dan Boneh. Sentinet: Detecting physical attacks against deep learning systems. arXiv, 2018. 1

[8] Jeremy M Cohen, Elan Rosenfeld, and J Zico Kolter. Cer- tified adversarial robustness via randomized smoothing. In ICML, 2019. 1, 2, 3

[9] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya Nepal. Strip: A defence against trojan attacks on deep neural networks. In ACSAC, 2019. 1

[10] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2015. 1

[11] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Bad- nets: Identifying vulnerabilities in the machine learning model supply chain. arXiv, 2017. 1

[12] Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. Tabor: A highly accurate approach to inspecting and restoring trojan backdoors in ai systems. arXiv, 2019. 1

[13] Jinyuan Jia, Xiaoyu Cao, Binghui Wang, and Neil Zhenqiang Gong. Certified robustness for top-k predictions against adversarial perturbations via randomized smoothing. In ICLR, 2020. 2

[14] Jinyuan Jia, Binghui Wang, Xiaoyu Cao, and Neil Zhenqiang Gong. Certified robustness of community detection against adversarial structural perturbation via randomized smoothing. In The Web Conference (WWW), 2020. 1, 2

[15] Yann LeCun, L´eon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 1998. 4

[16] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certified robustness to adversarial examples with differential privacy. In IEEE S & P, 2019. 2

[17] Guang-He Lee, Yang Yuan, Shiyu Chang, and Tommi S Jaakkola. Tight certificates of adversarial robustness for randomly smoothed classifiers. In NeurIPS, 2019. 1, 2, 3

[18] Bai Li, Changyou Chen, Wenlin Wang, and Lawrence Carin. Second-order adversarial attack and certifiable robustness. In NeurIPS, 2019. 2

[19] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random selfensemble. In ECCV, 2018. 2

[20] Yingqi Liu, Wen-Chuan Lee, Guanhong Tao, Shiqing Ma, Yousra Aafer, and Xiangyu Zhang. Abs: Scanning neural networks for back-doors by artificial brain stimulation. In CCS, 2019. 1

[21] Yingqi Liu, Shiqing Ma, Yousra Aafer, Wen-Chuan Lee, Juan Zhai, Weihang Wang, and Xiangyu Zhang. Trojaning attack on neural networks. In NDSS, 2018. 1

[22] Jerzy Neyman and Egon Sharpe Pearson. Ix. on the problem of the most efficient tests of statistical hypotheses. Philosophical Transactions of the Royal Society of London. Series A, Containing Papers of a Mathematical or Physical Character, 231(694-706):289–337, 1933. 1, 2

[23] Ximing Qiao, Yukun Yang, and Hai Li. Defending neural backdoors via generative distribution modeling. In NeurIPS, 2019. 1

[24] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed clas-sifiers. In NeurIPS, 2019. 2

[25] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. In ICLR, 2014. 1

[26] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bi- mal Viswanath, Haitao Zheng, and Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks. In IEEE S & P, 2019. 1

[27] Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on deep neural networks. In CCS, 2019. 1

[28] Runtian Zhai, Chen Dan, Di He, Huan Zhang, Boqing Gong, Pradeep Ravikumar, Cho-Jui Hsieh, and Liwei Wang. Macer: Attack-free and scalable robust training via maximizing certified radius. In ICLR, 2020. 2


Designed for Accessibility and to further Open Science