b

DiscoverSearch
About
My stuff
Regularization Helps with Mitigating Poisoning Attacks: Distributionally-Robust Machine Learning Using the Wasserstein Distance
2020·arXiv
Abstract
Abstract

We use distributionally-robust optimization for machine learning to mitigate the effect of data poisoning attacks. We provide performance guarantees for the trained model on the original data (not including the poison records) by training the model for the worst-case distribution on a neighbourhood around the empirical distribution (extracted from the training dataset corrupted by a poisoning attack) defined using the Wasserstein distance. We relax the distributionally-robust machine learning problem by finding an upper bound for the worst-case fitness based on the empirical sampled-averaged fitness and the Lipschitz-constant of the fitness function (on the data for given model parameters) as regularizer. For regression models, we prove that this regularizer is equal to the dual norm of the model parameters. We use the Wine Quality dataset, the Boston Housing Market dataset, and the Adult dataset for demonstrating the results of this paper.

Data poisoning attacks refer to adversarial attacks on machine learning by adding malicious entries to the training dataset in order to manipulate the model [34]. The malicious data could be in the form of label modification or flipping (i.e., changing the outputs in supervised machine learning), data insertion attacks (i.e., adding a limited number of arbitrary data points), and data modification attacks (i.e., modifying some features or labels for an arbitrary subset of the data). These attacks have proved to be powerful in practice [4, 3].

Regularization has been shown to be effective in training machine learning models that are robust against data poisoning attacks [11]. This is motivated by that regularization can reduce the impact of contaminated data records during training. Regularization can make the models generalize better to data that is not directly in the training dataset. This is however only made for support vector machines. A general analysis of why regularization can be an effective method for mitigating the effect of data poisoning attacks is missing from the literature. This is investigated in this paper with the aid of distributionally-robust machine learning.

In this paper, we show that, by using distributionally-robust optimization [12] for machine learning, we can combat data poisoning attacks. We show that we can guarantee the performance of a trained model on the original data (not including the poison records) by training the model for the worst-case distribution on a ball around the empirical distribution (extracted using the adversariallymanipulated training data) defined using the Wasserstein distance. The Wasserstein distance can be seen as the optimal mass transportation plan between two distributions [20]. We find an upper bound for the worst-case expected fitness based on the empirical sampled-averaged fitness and the Lipschitz-constant of the fitness function on the data for given model parameters. This allows us to relax the distributionally-robust machine learning problem to optimizing the empirical sampled-averaged fitness plus the Lipschitz constant as a regularizer. For regression models, we prove that this regularizer is equal to the dual norm of the model parameters. We use three different datasets, i.e., the Wine Quality dataset [9], the Boston Housing Market dataset [18], and the Adult dataset [23] for demonstrating the results of this paper.

The rest of the paper is organized as follows. We discuss the related work on poisoning attacks in adversarial machine learning in Section 2. Background information on the Wasserstein distance is presented in Section 3. In Section 4, we present the distributionally-robust machine learning problem using the Wasserstein distance and transform the distributionally-robust optimization problem into a regularized machine learning problem. We discuss the results in the context of linear and logistic regression in Section 5. Finally, we present some numerical results in Section 6 and conclude the paper in Section 7.

Machine learning under malicious noise has been investigated for a long time [33, 21, 7, 19, 31, 32]. The idea of sub-sampling was proposed in [21]. This approach relies on taking several random sub-samples of the dataset and training machine learning models on them. Then, we use the model with the smallest training error. Sub-sampling is powerful when a sufficiently small fraction of the data is malicious [34]. Approaches based on outlier removal are proposed in [22, 10, 2]. They rely on identifying and removing a small fraction of the data that is anomalous (i.e., follows a different distribution than the rest). Another approach is to use robust learning by trimmed optimization, which relies on minimizing the empirical risk while pruning a fraction of the data with the largest error [26]. Finally, regularization has been used as a method for mitigating data poisoning attacks when training support vector machines [11]. Although different in nature, regularization has been also shown to be useful for mitigating decision-time (evasion) attacks [24, 25].

Despite the success of regularization in adversarial machine learning, a general analysis of why it is an effective apparatus for mitigating data poisoning attacks is still missing from the literature. In addition, although Wasserstein distance has been used in machine learning [15, 1, 5, 12], it has not been used for designing distributionally-robust machine learning models against adversarial data poisoning attacks.

An important aspect of our work is that we do not assume anything about the adversary, e.g., its cost function or motivation, except that there exists a bound on its ability in poisoning the training dataset, as opposed to studies relying on game-theoretic methods for adversarial machine learning [13, 35, 27] that need to model an adversary perfectly.

Let M(Ξ) denote the set of all probability distributions Q on Ξ  ⊆ Rmsuch that Eξ∼Q{∥ξ∥} < ∞for some norm  ∥·∥on  Rm. When it is evident from the context, we replace  Eξ∼Q{·}with  EQ{·}. The Wasserstein distance W : M(Ξ)×M(Ξ)  →R≥0:=  {x ∈ R|x ≥ 0}is defined as

image

for all  P1, P2 ∈ M(Ξ) [20]. The decision variable Π can be interpreted as a transportation plan for moving mass distribution described by  P1to mass distribution described by  P2. In this case,  W(P1, P2) can be seen as the optimal mass transportation plan. This metric is sometimes referred to as the Earth-Mover distance. Note that  W(P1, P2) =  W(P2, P1). We can compute the Wasserstein distance using an alternative approach due to [20] by

image

where L denotes the set of all Lipschitz functions with Lipschitz constant upper bounded by one, i.e., all functions f such that  |f(ξ1) − f(ξ2)| ≤ ∥ξ1 − ξ2∥for all ξ1, ξ2 ∈Ξ. Now, we are ready to pose the problem of training machine learning models under poisoning attacks using distributionally-robust optimization.

image

Consider a supervised learning problem with dataset  {(xi, yi)}ni=1, where  xi ∈Rpxdenotes the inputs and  yi ∈ Rpydenotes the outputs. For instance, in image classification,  xiis a vector of pixels from an image (which could be compressed by extracting important features from the image) and  yiis the class to which that image belongs (mapped to a set of integers). Our goal is to extract a machine learning model  M(·; θ) :  Rpx → Rpyto describe the relationship between inputs and outputs. This is often done by solving the stochastic program

image

where Θ is the set of feasible model parameters,  ℓ : Rpy × Rpy → Ris the loss function, and P is the distribution of the inputs and outputs (x, y) of the model. Since we do not know the distribution P, we rely on the training dataset {(xi, yi)}ni=1to solve the sample-averaged approximation problem in

image

Even if we knew P, we cannot solve (1) directly (in general) as integrating the loss function for arbitrary distributions is a computationally complicated task [17]. The approximation in (2) is the basis of machine learning literature. When n is large enough, solving (2) is a good proxy for (1) if the training dataset is clean. However, in practice, the dataset might be compromised due to adversarial inputs.

We assume that the training dataset is composed of two types of correct and malicious samples. The correct samples are independently and identically distributed according to probability density function P. The malicious samples are inserted into the training dataset by an adversary with the hope manipulating the trained model. These samples are independently and identically distributed according to probability density function Q. When training the machine learning model, we do not know either of these probability distributions. We only know that a sample is correct with probability 1  − βand malicious with probability  β. That is, the ratio of the corrupted data entries is  β. Hence, any given record (xi, yi) in the training dataset are independently and identically distributed according to D = (1  − β)P + βQ. We make the following standing assumption regarding the distribution of the training data.

image

Assumption 1 implies that D, which is composed of P and Q, is a lighttailed distribution. All probability distributions with compact (i.e., bounded and closed) support meet this condition. This assumption is often implicit in the machine learning literature as, for heavy-tailed distributions (i.e., distributions that do not meet Assumption 1), the sample average of the loss may not even converge to the expected loss in general [6, 8] and, as a result, (2) might not be a good proxy for (1).

Let us define the empirical probability distribution

image

where  δξis the Dirac distribution function with its mass concentrated at  ξ, i.e., �δξ(x)dx= 1 and  δξ(x) = 0 for all  x ̸= ξ. Evidently,

image

and, as a result, (2) can be rewritten as

image

By distributionally-robust machine learning, we mean that we would like to devise an algorithm for extracting a trained machine learning model ˆθn ∈Θ such that

image

where  β ∈(0, 1) is a significance parameter with respect to distribution  Dnand ˆJnis an upper bound that may depend on the training dataset (xi, yi)ni=1and the trained model ˆθn. Clearly, ˆJnand ˆθnboth depend on the choice of significance parameter  β ∈(0, 1).

In this paper, we extract the trained model by solving the distributionally-robust optimization problem in

image

for some constant  ρ >0. Note that ˆθn ∈Θ is the minimizer of (6). This problem formulation is motivated by that the distributions of the correct samples P is within a small neighbourhood of the empirical probability distribution �Dnin the sense of the the Wasserstein distance if n is large enough and  βis relatively small. Therefore, optimizing supG:W(G,�Dn)≤ρ EG{ℓ(M(x; θ), y)}can act as a surrogate for optimizing  EP{ℓ(M(x; θ), y)}. This is proved in the following lemma.

image

image

Proof. Following [14] and [12], we know that  Dn{W(�Dn, D) ≤ ζ(γ)} ≤ 1 − γ.Now, note that

image

where the first inequality follows from the convexity of the Wasserstein distance [28, Lemma 2.1] and the second inequality follows from that W(P, P) = 0. The triangle inequality for the Wasserstein distance [29, p. 170] implies that W(�Dn, P) ≤ W(�Dn, D) + W(D, P), and, using the inequality in (7),  W(�Dn, P) ≤W(�Dn, D) +  βW(Q, P).Therefore,  W(�Dn, P) ≤ ζ(γ) +  βW(D, P) if  W(�Dn, D) ≤ζ(γ), which implies that  Dn{W(�Dn, P) ≤ ζ(γ) +  βW(Q, P)} ≥ 1 − γ.This concludes the proof.

Now, that we know that the distributions of the correct samples P is within a small neighbourhood of the empirical probability distribution �Dn(defined by the Wasserstein distance) with a high probability, the solution of (6) should provide a bound for the expected loss function in the sense of (5). This is proved in the following theorem.

image

Proof. The proof follows from [12, Theorem 3.4].

We can solve (6) explicitly following the approach in [12]. However, in this paper, we show that we can simplify the aforementioned distributionally-robust optimization problem by finding an upper bound for the worst-case expected fitness supG:W(G,�Dn)≤ρ EG{ℓ(M(x; θ), y)}. This is done in the following lemma.

Lemma 2. Assume that  ℓ(M(x; θ), y) is  L(θ)-Lipschitz continuous in (x, y) for a fixed  θ ∈Θ. Then,

image

Proof. First, note that

image

We have

image

where Π is a joint disribution on (x, y) and (x′, y′) with marginals G and �Dn, respectively. Therefore,

image

This concludes the proof.

For continuously-differentiable  ℓ(M(x; θ), y), we can compute the Lipschitz constant based on the gradients. However, continuous differentiability is not required for Lipschitz continuity [30, p. 61]. We explore continuously-differentiable loss functions in the next lemma.

image

Proof. If  ℓ(M(x; θ), y) is continuously differentiable in (x, y), the mean value theorem [16] implies that, there exists (x′′, y′′) over the convex combination of (x, y) and (x′, y′) such that

image

where  ∥ · ∥∗is the dual norm of  ∥ · ∥. Hence, we can show that  ℓ(M(x; θ), y) is L(θ)-Lipschitz continuous in (x, y).

Note that we can select any function of the model parameters  L(θ) that is larger than the norm of the gradient sup(x,y)∈supp(D) ∥∇(x,y)ℓ(M(x; θ), y)∥∗in Lemma 3 as the Lipschitz constant, and thus as the regularizer. The choice in Lemma 3 is merely the tightest Lipschitz constant for continuously-differentiable loss functions.

Now, we can define the regularized sample-averaged optimization problem in

image

We can similarly define ˆθn ∈Θ as the minimizer of (8). This optimization problem no longer involves taking a supremum over the probability density function and is thus computationally much easier to solve in compassion to (6) (as it does not require solving an infinite-dimensional optimization problem). We can still prove a performance guarantee for the optimizer of (8) in the sense of (5). This is done in the next theorem.

image

Although the results of this paper hold for any machine learning problem, convex or non-convex, we simplify the regularizer for linear and logistic regression models for demonstration purposes. This is the topic of the next section.

We start by investigating linear regression and then continue to logistic regression.

5.1 Linear Regression

We start with linear regression. Without loss of generality, assume  py= 1. If py >1, we can treat each output independently. In this case,

image

and the fitness function is

image

We also assume that (x, y) belong to compact set  X × Y ⊆ Rpx × R. Following Lemma 3, we have

image

where

image

The regularizer in (10) is the tightest that we can use in (8). However, it requires solving an optimization problem for computing the regularizer which could lead to the use of additional computational resources. We can further simplify this Lipschitz constant by noting that

image

and

image

Therefor, we get a more conservative (i.e., larger) Lipschitz constant:

image

where X = maxx∈X ∥x∥and Y = maxy∈Y |y|. Therefore, for linear regression, the adversarially-robust optimization problem in (8) can be rewritten as

image

An alternative to the quadratic loss function in (9) is

image

In this case, we get  L(θ) =  ∥θ∥∗and the optimization problem in (8) can be rewritten as

image

Now, we move on to the logistic regression.

5.2 Logistic Regression

In this case, we have

image

and the fitness function is

image

Note the opposite sign with the log likelihood as here we minimize the expected fitness. We have

image

and

image

Therefore, we get

image

The robust optimization problem in (8) becomes

image

In the remainder of this paper, we demonstrate the results of this paper for regression models discussed above.

image

Figure 1: Test performance of the trained model for the Wine Quality dataset when using (11) to mitigate the effect of data modification attacks versus the percentage of the poisoned data  βfor various choices of regularization.

6.1 Dataset Description

We use three different datasets to demonstrate the results of this paper. The first dataset is the red Wine Quality dataset containing 1599 records of red wine tests [9]. There are 11 inputs capturing physicochemical and sensory measurements for the wine. The output is the quality scores ranging from zero to ten. The second dataset is the Boston Housing Market dataset containing 506 records from Boston neighbourhood house prices [18]. The inputs are features, such as per capita crime rate in the neighbourhood and average number of rooms. The output is the median home prices in the neighbourhood. The final dataset is the Adult dataset containing 48,842 records from the 1994 Census [23]. In this dataset, there are 14 individual attributes, such as age, race, occupation, and relationship status, as inputs and income level (i.e., above or below $50K per annum) as output. We translate all categorical attributes and outputs to integers.

6.2 Experiment Setup

We demonstrate the effect of data poisoning on regression. We train linear regression models for the Wine Quality and the Boston Housing Market datasets according to (11) and train logistic regression models for the Adult dataset according to (15). We split each dataset into two equal halves: one for training and the other for test. We consider two form of attacks: label flipping and data

image

Figure 2: Test performance of the trained model for the Wine Quality dataset when using (11) to mitigate the effect of label flipping attacks versus the percentage of the poisoned data  βfor various choices of regularization.

modification.

When using linear regression for the Wine Quality dataset, we use two attacks. The first attack is the data modification by changing the features of  βpercent of the dataset to a Gaussian variable with variance of 4. The second attack is a label flipping the output for  βpercent of the dataset to 10. When using linear regression for the Boston Housing Market dataset, we use the data modification by changing the features of  βpercent of the dataset to a Gaussian variable with variance of 100. Finally, when using logistic regression for the Adult dataset, we perform label flipping attack by negating the output for  βpercent of the dataset.

6.3 Experimental Results

We first consider the Wine Quality dataset. Figure 1 shows the test performance of the trained model using (11) for the data modification attack versus the percentage of the poisoned data  βfor various choices of regularization. In this graph,  c1denotes the weight behind  ∥θ∥2∗and  c2denotes the weight behind ∥θ∥∗in (11) . Here, we use the 2-norm as  ∥ · ∥. Therefore,  ∥ · ∥∗is also equal to the 2-norm. For large enough attack levels (i.e., large  β), the test performance for the regularized model is significantly better. However, this comes at the cost of a lower performance when there is no poisoning attack. This is because we are solving a distributionally-robust machine learning problem to mitigate the effect of poisoning attacks (with the conservative-ness dictated by  ρwhich is proportional to  c1and  c2). The robust model can act more conservative if there are no attacks as in those cases there was no need for robustness and regular-

image

Figure 3: Test performance of the trained model for the Boston Housing Market dataset when using (11) to mitigate the effect of data modification attacks versus the percentage of the poisoned data  βfor various choices of regularization.

ization. Figure 2 shows the test performance of the trained model using (11) for the label flipping attack versus the percentage of the poisoned data  βfor various choices of regularization. The same trend can also be observed in this case.

Now, we move to the Boston Housing Market dataset. Figure 3 shows the test performance of the trained model using (11) for the data modification attack versus the percentage of the poisoned data  βfor various choices of regularization. The regularized models beat the non-regularized one in terms of performance in the presence of poisoning attacks.

Finally, we consider the Adult dataset. Figure 4 illustrates the test performance of the trained model when using the logistic regression training in (15) to mitigate the effect of label flipping attacks versus the percentage of the poisoned data  βfor various choices of regularization. The regularization here also offers robustness against poisoning attacks.

We use distributionally-robust optimization for machine learning to mitigate the effect of data poisoning attacks. We show that this problem can be cast as a regularized machine learning problem. We provide performance guarantees for the trained model on the correct data by training the model for the worst-case distribution on a neighbourhood around the empirical distribution (extracted from the training dataset corrupted by a poisoning attack) defined using the Wasserstein distance. Future work can focus on validation of the results on

image

Figure 4: Test performance of the trained model for the Adult dataset when using (15) to mitigate the effect of label flipping attacks versus the percentage of the poisoned data  βfor various choices of regularization.

more general machine learning models, such as neural networks.

[1] Martin Arjovsky, Soumith Chintala, and L´eon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.

[2] Marco Barreno, Blaine Nelson, Anthony D Joseph, and J Doug Tygar. The security of machine learning. Machine Learning, 81(2):121–148, 2010.

[3] Battista Biggio, B Nelson, and P Laskov. Poisoning attacks against support vector machines. In 29th International Conference on Machine Learning, pages 1807–1814, 2012.

[4] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial label noise. In Asian Conference on Machine Learning, pages 97–112, 2011.

[5] Jose Blanchet, Yang Kang, and Karthyek Murthy. Robust wasserstein profile inference and applications to machine learning. Journal of Applied Probability, 56(3):830–857, 2019.

[6] Christian Brownlees, Emilien Joly, G´abor Lugosi, et al. Empirical risk minimization for heavy-tailed losses. The Annals of Statistics, 43(6):2507– 2536, 2015.

[7] Nader H Bshouty, Nadav Eiron, and Eyal Kushilevitz. Pac learning with nasty noise. Theoretical Computer Science, 288(2):255–275, 2002.

[8] Olivier Catoni. Challenging the empirical mean and empirical variance: A deviation study. In  Annales de l’Institut Henri Poincar´e, Probabilit´es etStatistiques, volume 48, pages 1148–1185, 2012.

[9] Paulo Cortez, Ant´onio Cerdeira, Fernando Almeida, Telmo Matos, and Jos´e Reis. Modeling wine preferences by data mining from physicochemical properties. Decision Support Systems, 47(4):547–553, 2009.

[10] Gabriela F Cretu, Angelos Stavrou, Michael E Locasto, Salvatore J Stolfo, and Angelos D Keromytis. Casting out demons: Sanitizing training data for anomaly sensors. In 2008 IEEE Symposium on Security and Privacy (sp 2008), pages 81–95. IEEE, 2008.

[11] Ambra Demontis, Battista Biggio, Giorgio Fumera, Giorgio Giacinto, and Fabio Roli. Infinity-norm support vector machines against adversarial label contamination. In ITASEC, pages 106–115, 2017.

[12] Peyman Mohajerin Esfahani and Daniel Kuhn. Data-driven distributionally robust optimization using the wasserstein metric: Performance guarantees and tractable reformulations. Mathematical Programming, 171(1-2):115– 166, 2018.

[13] Farhad Farokhi. A game-theoretic approach to adversarial linear support vector classification, 2019. Preprint: arXiv:1906.09721.

[14] Nicolas Fournier and Arnaud Guillin. On the rate of convergence in wasser- stein distance of the empirical measure. Probability Theory and Related Fields, 162(3-4):707–738, 2015.

[15] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, Mauricio Araya, and Tomaso A Poggio. Learning with a wasserstein loss. In Advances in Neural Information Processing Systems, pages 2053–2061, 2015.

[16] William S. Hall and Martin L. Newell. The mean value theorem for vector valued functions: A simple proof. Mathematics Magazine, 52(3):157–158, 1979.

[17] Grani A Hanasusanto, Daniel Kuhn, and Wolfram Wiesemann. A com- ment on computational complexity of stochastic programming problems. Mathematical Programming, 159(1-2):557–569, 2016.

[18] David Harrison Jr and Daniel L Rubinfeld. Hedonic housing prices and the demand for clean air. Journal of environmental economics and management, 5(1):81–102, 1978.

[19] Adam Tauman Kalai, Adam R Klivans, Yishay Mansour, and Rocco A Servedio. Agnostically learning halfspaces. SIAM Journal on Computing, 37(6):1777–1805, 2008.

[20] L V Kantorovich and G Rubinshtein. On a space of totally additive func- tions. Vestn. Lening. Univ, 13:52–59, 1958.

[21] Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4):807–837, 1993.

[22] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious noise. Journal of Machine Learning Research, 10(Dec):2715–2740, 2009.

[23] Ron Kohavi. Scaling up the accuracy of Naive-Bayes classifiers: A decision- tree hybrid. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pages 202–207, 1996.

[24] Bo Li and Yevgeniy Vorobeychik. Feature cross-substitution in adversarial classification. In Advances in neural information processing systems, pages 2087–2095, 2014.

[25] Bo Li and Yevgeniy Vorobeychik. Evasion-robust classification on binary domains. ACM Transactions on Knowledge Discovery from Data (TKDD), 12(4):50, 2018.

[26] Chang Liu, Bo Li, Yevgeniy Vorobeychik, and Alina Oprea. Robust linear regression against training data poisoning. In Proceedings of the 10th ACM Workshop on Artificial Intelligence and Security, pages 91–102. ACM, 2017.

[27] Yifan Ou and Reza Samavi. Mixed strategy game model against data poisoning attacks, 2019. Preprint: arXiv:1906.02872.

[28] G. C. Pflug and A. Pichler. Multistage Stochastic Optimization. Springer Series in Operations Research and Financial Engineering. Springer International Publishing, 2014.

[29] A. Pr¨ugel-Bennett. The Probability Companion for Engineering and Computer Science. Cambridge University Press, 2020.

[30] B. D. Reddy, D .D. Reddy, J. E. Marsden, L. Sirovich, M. Golubitsky, and W. Jager. Introductory Functional Analysis: With Applications to Boundary Value Problems and Finite Elements. Introductory Functional Analysis Series. Springer, 1998.

[31] Rocco A Servedio. Smooth boosting and learning with malicious noise. Journal of Machine Learning Research, 4(Sep):633–648, 2003.

[32] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning attacks. In Advances in Neural Information Processing Systems, pages 3517–3529, 2017.

[33] Leslie G Valiant. Learning disjunction of conjunctions. In International Joint Conferences on Artificial Intelligence (IJCAI), pages 560–566, 1985.

[34] Yevgeniy Vorobeychik and Murat Kantarcioglu. Adversarial machine learn- ing. Synthesis Lectures on Artificial Intelligence and Machine Learning, 12(3):1–169, 2018.

[35] Rui Zhang and Quanyan Zhu. A game-theoretic defense against data poison- ing attacks in distributed support vector machines. In 2017 IEEE 56th Annual Conference on Decision and Control (CDC), pages 4582–4587. IEEE, 2017.


Designed for Accessibility and to further Open Science