Adaptive versus Standard Descent Methods and Robustness Against Adversarial Examples

2019·Arxiv

Abstract

Abstract

Adversarial examples are a pervasive phenomenon of machine learning models where seemingly imperceptible perturbations to the input lead to misclassifications for otherwise statistically accurate models. In this paper we study how the choice of optimization algorithm influences the robustness of the resulting classifier to adversarial examples. Specifically we show an example of a learning problem for which the solution found by adaptive optimization algorithms exhibits qualitatively worse robustness properties against both -adversaries than the solution found by non-adaptive algorithms. Then we fully characterize the geometry of the loss landscape of -adversarial training in least-squares linear regression. The geometry of the loss landscape is subtle and has important consequences for optimization algorithms. Finally we provide experimental evidence which suggests that non-adaptive methods consistently produce more robust models than adaptive methods.

1 Introduction

Adversarial examples are a pervasive phenomenon of machine learning models where perturbations of the input that are imperceptible to humans reliably lead to confident incorrect classifications (Szegedy et al. (2013); Good- fellow et al. (2014)). Since this phenomenon was first observed, researchers have attempted to develop methods which produce models that are robust to adversarial perturbations under specific attack models (Wong and Kolter (2018); Sinha et al. (2018); Raghunathan et al. (2018); Mirman et al. (2018); Madry et al. (2018); Zhang et al. (2019)). As machine learning proliferates into society, including security-critical settings like health care (Esteva et al. (2017)) or autonomous vehicles (Codevilla et al. (2018)), it is crucial to develop methods that allow us to understand the vulnerability of our models and design appropriate counter-measures.

Additionally there is a growing literature on the theory of adversarial examples. Many of these results attempt to understand adversarial examples by constructing examples of learning problems for which it is difficult to construct a classifier that is robust to adversarial perturbations. This difficulty may arise due to sample complexity (Schmidt et al. (2018)), computational constraints (Bubeck et al. (2019); Degwekar et al. (2019)), or the high-dimensional geometry of the initial feature space (Shafahi et al. (2019); Khoury and Hadfield-Menell (2018)). We expand upon these results in Section 2.

Currently less well-understood, and to our knowledge not addressed by the theoretical literature on adversarial examples, is how our algorithmic choices effect the robustness of our models. With respect to optimization and generalization, but importantly not robustness, the success of standard (or non-adaptive) gradient descent methods, including stochastic gradient descent (SGD) and SGD with momentum, is starting to be better understood (Du et al. (2019); Allen-Zhu et al. (2019); Gunasekar et al. (2018a;b)). However, as an increasing amount of time has been spent training deep networks, researchers and practitioners have heavily employed adaptive gradient methods, such as Adam (Kingma and Ba (2015)), Adagrad (Duchi et al. (2011)), and RMSprop (Tieleman and Hinton (2012)), due to their rapid training times (Karparthy (2017)). Unfortunately the properties of adaptive optimization algorithms are less well-understood than those of their non-adaptive counterparts. Wilson et al. (2017) provide theoretical and empirical evidence which suggests that adaptive algorithms often produce solutions that generalize worse than those found by non-adaptive algorithms.

In this paper, we study the robustness of solutions found by adaptive and non-adaptive algorithms to adversarial examples. Furthermore we study the effect of adversarial training on the geometry of the loss landscape and, consequently, on the solutions found by adaptive and non-adaptive algorithms for the adversarial training objective. Our paper makes the following contributions.

• We show an example of a learning problem for which the solution found by adaptive optimization algorithms exhibits qualitatively worse robustness properties against both - and -adversaries than the solution found by non-adaptive algorithms. Furthermore the robustness of the adaptive solution decreases rapidly as the dimension of the problem increases, while the robustness of the non-adaptive solution is stable as the dimension increases.

• We fully characterize the geometry of the loss landscape of -adversarial training in least-squares linear regression. The -adversarial training objective is convex everywhere; moreover, it is strictly convex everywhere except along either 0, 1, or 2 line segments, depending on the value of . Furthermore for nearly all choices1 of , these line segments along which is convex, but not strictly convex, lie outside of the rowspace and the gradient along these line segments is nonzero. It follows that any reasonable optimization algorithm finds the unique global minimum of .

• We conduct an extensive empirical evaluation to explore the effect of different optimization algorithms on robustness. Our experimental results suggest that non-adaptive methods consistently produce more robust models than adaptive methods.

• We provide a dataset consisting of 190 pretrained models on MNIST and CIFAR10 with various hyperparameter settings. Of these 190 pretrained models, 150 were used to find the best hyperparamter settings for our experiments and evaluated on a validation set. The remaining 40 pretrained models were evaluated on the test set. Of the 150 validation models, 88 were trained using natural training and 62 were trained using adversarial training. Of the 40 test models, 20 were trained using natural training and 20 were trained using adversarial training. They can be downloaded at https://www.dropbox.com/s/edfcnb97lzxl19z/models.zip.

2 Related Work

There has been a long line of work on the theory of adversarial examples. Schmidt et al. (2018) explore the sample complexity required to produce robust models. They demonstrate a simple setting, a mixture of two Gaussians, in which a linear classifier with near perfect natural accuracy can be learned from a single sample, but any algorithm that produces any binary classifier requires samples to produce a robust classifier. Followup work by Bubeck et al. (2019) suggests that adversarial examples may arise from computational constraints. They exhibit pairs of distributions that differ only in a k-dimensional subspace, and are otherwise standard Gaussians, and show that while it is information-theoretically possible to distinguish these distributions, it requires exponentially many queries in the statistical query model of computation. We note that both of these constructions produce distributions whose support is the entirety of .

Bubeck et al. (2019) further characterize five mutually exclusive “worlds” of robustness, inspired by similar characterizations in complexity theory (Impagliazzo (1995)). A learning problem must fall into one of the following possibilities:

World 4: Computationally efficient robust classifiers exist and can be learned from few samples, but learning is inefficient.

While learning problems can be constructed that fall into each possible world, the question for researchers is into which world are problems from practice most likely to fall? Every theoretical construction, such as those by Schmidt et al. (2018) and Bubeck et al. (2019), can be thought of as providing evidence for the prevalence of one of the worlds. In the language of Bubeck et al. (2019), the sampling complexity result of Schmidt et al. (2018) provides evidence for World 3, by constructing an example of a problem that falls into world three. The learning problem constructed by Bubeck et al. (2019) provides evidence for World 4. Subsequent work by Degwekar et al. (2019) provides evidence for Worlds 2 and 4. Under standard cryptographic assumptions, Degwekar et al. (2019) construct an a learning problem for which a computationally efficient non-robust classifier exists, no efficient robust classifier exists, but an inefficient robust classifier exists. Similarly, assuming the existence of one-way functions, they construct a learning problem for which an efficient robust classifier exists, but it is computationally inefficient to learn a robust classifier. Finally, in an attempt to understand how likely World 4 is in practice, they show that any task where an efficient robust classifier exists but is hard to learn in polynomial time implies one-way functions.2

Additionally there is a line work that attempts to explain the pervasiveness of adversarial examples through the lens of high-dimensional geometry. Gilmer et al. (2018) experimentally evaluated the setting of two concentric under-sampled 499-spheres embedded in , and concluded that adversarial examples occur on the data manifold. Shafahi et al. (2019) suggest that adversarial examples may be an unavoidable consequence of the high-dimensional geometry of data. Their result depends upon the use of an isopermetric inequality. The main drawback of these results, as well as the constructions of Schmidt et al. (2018) and Bubeck et al. (2019), is that they assume that the support of the data distribution has full or nearly full dimension. We do not believe this to be the case in practice. Instead we believe that the data distribution is often supported on a very low-dimensional subset of . This case is addressed in Khoury and Hadfield-Menell (2018), where they consider the problem of constructing decision boundaries robust to adversarial examples when data is drawn from a low-dimensional manifold embedded in . They highlight the role of co-dimension, the difference between the dimension of the embedding space and the dimension of the data manifold, as a key source of the pervasiveness of adversarial vulnerability. Said differently, it is the low-dimensional structure of features embedded in high-dimensional space that contributes, at least in part, to adversarial examples. This idea is also explored in Nar et al. (2019), but with emphasis on the cross-entropy loss.

We believe that problems in practice are most likely to fall into World 5, the best of all worlds. Problems in this class have robust classifiers which are efficient to evaluate and can be learned efficiently from relatively few samples. We simply haven’t found the right algorithm for learning such classifiers. The goal of this paper is to explore the effect of our algorithms on robustness. Specifically we wish to understand the robustness properties of solutions found by common optimization algorithms. To our knowledge no other work has explored the robustness properties of solutions found by different optimization algorithms.

3 Adaptive Algorithms May Signiﬁcantly Reduce Robustness

Wilson et al. (2017) explore the effect of different optimization methods on generalization both in a simple theoretical setting and empirically. For their main theoretical result, they construct a learning problem for which the solution found by any adaptive method, denoted , has worse generalization properties than the solution found by non-adaptive methods, denoted . We recall their construction in Section 3.1. In Section 3.2 we describe the adaptive solution and in Section 3.3 we describe the non-adaptive solution .

Generalization and robustness are different properties of a classifier. A classifier can generalize well but have terrible robustness properties, as we often see in practice. On the other hand, a constant classifier generalizes poorly, but has perfect robustness (Zhang et al. (2019)). Wilson et al. (2017) study the generalization properties of and , but not their robustness properties. In Section 3.4 we study the robustness properties of and . Specifically, we show that exhibits superior robustness properties to against both - and -adversaries.

3.1 A Simple Learning Problem

Let be a design matrix representing a dataset with n sample points and d features and let be a vector of labels. Wilson et al. (2017) restrict their attention to binary classification problems of this type, and learn a classifier by minimizing the least-squares loss

They construct the following learning problem for which they can solve for both the adaptive and non-adaptive solutions in closed form. Their construction uses an infinite-dimensional feature space for simplicity, but they note that 6n dimensions suffice. For , sample with probability p, and with probability for some p > 0.5. Then set to be the infinite-dimensional vector

For example, a dataset with three sample points following Equation 3 is 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . . 1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 . . .

The first feature encodes the label, and is alone sufficient for classification. Note that this trick of encoding the label is also commonly used in the robustness literature to construct examples of hard-to-learn-robustly problems (Bubeck et al. (2019); Degwekar et al. (2019)). The second and third feature are identically 1 for every sample. Then there is a subset of five dimensions which are identified with and contain a set of features which are unique to . If then there is a single 1 in this subset of five dimensions and is the only sample with a 1 in this dimension. If then all five dimensions are set to 1 and again is the only sample with a 1 at these five positions.

While this problem may seem contrived, it contains several properties that are common in machine learning problems and that are particularly important for robustness. It contains a single robust feature that is strongly correlated with the label. However it may not be easy for an optimization algorithm to identify such a feature. Additionally there are many non-robust features which are weakly or not at all correlated with the label, but which may appear useful for generalization because they are uniquely identified with samples from specific classes. Wilson et al. (2017) show that both adaptive and non-adaptive methods find classifiers that place at least some weight on every nonzero feature.

3.2 The Adaptive Solution wada

Let (X, y) be generated by the generative model in Section 3.1. When initialized at the origin, Wilson et al. (2017) show that any adaptive optimization algorithm – such as RMSprop, Adam, and Adagrad – minimizing Equation 1 for (X, y) converges to where

Thus we can write for some positive constant . On a test example , that is distinct from all the training examples, . Thus labels every unseen example as a positive example.

3.3 The Non-adaptive Solution wSGD

For (X, y), let P, N denote the sets of positive and negative samples in X respectively. Let and note that . When the weight vector is initialized in the row space of X, Wilson et al. (2017) show that all non-adaptive methods – such as gradient descent, SGD, SGD with momentum, Nesterov’s method, and conjugate gradient – converge to , where denotes the pseudo-inverse. That is, among the infinitely many solutions of the underdetermined system Xw = y, non-adaptive methods converge to the solution which minimizes , and thus maximizes the -margin. Specifically where

Note that these values for differ slightly from those presented in Wilson et al. (2017). In Appendix B we discuss in detail two errors in their derivation that lead to this discrepancy. These errors do not qualitatively change their results. Furthermore, in Appendix A.2 we carefully discuss under what conditions is positive and negative for . For now, we simply state that for all correctly classifies every test example.

3.4 Analyzing the Robustness of wada and wSGD

In this section we analyze the robustness properties of and against - and -adversaries. We show that exhibits considerably more robustness against both - and -adversaries than . A priori this is surprising; one may have expected , which is a small -norm solution, to be more robust to -perturbations, while , which is a small -norm solution, to be robust to -perturbations. However this expectation is wrong. Interestingly the robustness of against both - and -adversaries decreases as the dimension increases, whereas the robustness of does not. Finally, neither method recovers the “obvious” solution , which generalizes well and is optimally robust against both - and -perturbations.

Theorems 1 and 3 are our main results of this section; the proofs are deferred to Appendix A. We start by computing the robustness of against - and -adversaries.

Theorem 1. Let be a test sample that is correctly classified by and let be a perturbation. The adaptive solution is robust against any -perturbation for which

and any -perturbation for which

Furthermore these bounds are tight, meaning that an - or -ball with these radii centered at intersects the decision boundary.

Corollary 2. Asymptotically, the - and -robustness of are, respectively,

In particular both the - and -robustness go to 0 as the number of samples .

Corollary 2 makes clear, qualitatively, the result in Theorem 1. The rate at which the - and -robustness of decrease reflects a dependence on dimension. The number of dimensions on which puts nonzero weight increases as we increase the number of samples, which reduces robustness. We also find it interesting that, despite classifying every test point as a positive example, ’s predictions on correctly classified test samples are brittle. In summary, exhibits nearly no robustness against - or -adversaries.

Next we show that exhibits significant robustness against both - and -adversaries.

Theorem 3. Let be a test sample that is correctly classified by and let be a perturbation. The SGD solution is robust against any -perturbations for which

and any -perturbation for which

Furthermore these bounds are tight, meaning that an - or -ball with these radii centered at intersects the decision boundary.

Corollary 4. Asymptotically, the - and -robustness of are both . In particular the -robustness approaches 1 and the -robustness approaches as the number of samples .

Unsurprisingly, , which maximizes the -margin, exhibits near optimal robustness against -adversaries. As the number of samples increases, the -robustness of approaches 1. Perhaps surprisingly, also exhibits moderate robustness to -perturbations. As the number of samples increases, the -robustness of approaches . Unlike , the amount of robustness exhibited by does not decrease as the dimension increases, instead asymptotically approaching a constant.

However the -robustness of is not exactly 1 for any finite sample. One class approaches 1 from above, while the other class approaches 1 from below. To maximize the margin, places a small amount of weight on every other nonzero feature, even though all but the first are useless for classification. This lack of sparsity is also what causes the -robustness to drop from a possible maximum of 1 to . In contrast, generalizes perfectly, has -robustness equal to 1 for both classes, and, as an added benefit, has -robustness equal to 1 for both classes. Thus we have an example of a problem for which the max -margin solution could reasonably be considered to not be the best classifier against -perturbations.

Furthermore, is not in the row space of X. (is the projection of onto the row space.) Thus non-adaptive methods, when restricted to the row space, are incapable of recovering , irrespective of sample complexity (Schmidt et al. (2018)) or computational considerations (Bubeck et al. (2019)). This is simply the wrong algorithm for the desired objective. In the next section we study the effect that adversarial training has on the loss landscape and on the solutions found by various optimization algorithms.

4 Adversarial Training (Almost) Always Helps

In the previous section we presented a learning problem for which adaptive optimization methods find a solution with significantly worse robustness properties against both - and -adversaries compared to non-adaptive methods. In this section we consider a different algorithm, adversarial training, for finding robust solutions to Equation 1. We are interested in two questions. First, does adversarial training sufficiently regularize the loss landscape so that adaptive and non-adaptive methods find solutions with identical or qualitatively similar robustness properties? Second, are the solutions to the robust objective qualitatively different than those found by natural training or does adversarial training simply choose a robust solution from the space of solutions to the natural problem? We address the first question in Section 4.2 for -adversarial training and the second in Section 4.3 for the learning problem defined in Section 3.

4.1 The Adversarial Training Objective

Madry et al. (2018) formalize adversarial training by introducing the robust objective

where D is the data distribution, is a perturbation set meant to enforce a desired constraint, and L is a loss function. The goal then is to find a setting of the parameters w of the model that minimize the expected loss against the worst-case perturbation in .

Take L as in Equation 1 and to be an -ball of radius . In the linear case, we can solve the inner maximization problem exactly.

The third identity follows from the definition of the dual norm, where denotes the norm dual to the norm that defines . As a technical note, it is important that sign(0) = 1 (or ) and not equal to 0. This choice represents the fact that the solution to the inner maximization problem for each individual squared term is nonzero even if .

At first glance the objective looks similar to ridge regression or Lasso, particularly when we consider -and -adversarial training for which the dual norms are and respectively. However the solutions to this objective are not, in general, identical to the ridge regression or Lasso solutions. In Section 4.2 we will show how the second term influences the geometry of the loss landscape when .

4.2 The Geometry of the Loss Landscape

For the remainder of the paper we will exclusively analyze the case where is an -ball, leaving the case of for future work. We define the loss of interest

To build intuition, suppose that Xw = y is an underdetermined system. (Our results will not depend on this assumption.) The set of solutions is given by the affine subspace , where is the pseudo-inverse. The first thing to notice about is that, on its own, it is non-convex, having local minima both at the origin and in S. Along any path starting at the origin and ending at a point in S, the loss landscape induced by is negatively curved.

The second thing to notice about is that it is non-smooth. To understand the loss landscape of Equation 12, it is crucial to understand where is non-smooth. The term is non-smooth at any point w where some . Geometrically, is the equation of a hyperplane with normal vector and bias . The hyperplane partitions into two halfspaces , such that every point has and has . The set of hyperplanes define a hyperplane arrangement H, a subdivision of into convex cells. See Figure 1. Let be a cell of the hyperplane arrangement3. Every point w in the interior Int C of C lies on the same side of every hyperplane as every other point in Int C. Thus we can identify each C with a signature for any .

Theorem 5, our main result of this section, fully characterizes the geometry of Equation 12.

Theorem 5. is always a convex function, whose optimal solution(s) always lie in rowspace(X). There are four possible cases, three of which depend on the value of .

1. If Xw = y is an inconsistent system, then is a strictly convex function.

2. If Xw = y is a consistent system and