Adversarial examples are a pervasive phenomenon of machine learning models where perturbations of the input that are imperceptible to humans reliably lead to confident incorrect classifications (Szegedy et al. (2013); Good- fellow et al. (2014)). Since this phenomenon was first observed, researchers have attempted to develop methods which produce models that are robust to adversarial perturbations under specific attack models (Wong and Kolter (2018); Sinha et al. (2018); Raghunathan et al. (2018); Mirman et al. (2018); Madry et al. (2018); Zhang et al. (2019)). As machine learning proliferates into society, including security-critical settings like health care (Esteva et al. (2017)) or autonomous vehicles (Codevilla et al. (2018)), it is crucial to develop methods that allow us to understand the vulnerability of our models and design appropriate counter-measures.
Additionally there is a growing literature on the theory of adversarial examples. Many of these results attempt to understand adversarial examples by constructing examples of learning problems for which it is difficult to construct a classifier that is robust to adversarial perturbations. This difficulty may arise due to sample complexity (Schmidt et al. (2018)), computational constraints (Bubeck et al. (2019); Degwekar et al. (2019)), or the high-dimensional geometry of the initial feature space (Shafahi et al. (2019); Khoury and Hadfield-Menell (2018)). We expand upon these results in Section 2.
Currently less well-understood, and to our knowledge not addressed by the theoretical literature on adversarial examples, is how our algorithmic choices effect the robustness of our models. With respect to optimization and generalization, but importantly not robustness, the success of standard (or non-adaptive) gradient descent methods, including stochastic gradient descent (SGD) and SGD with momentum, is starting to be better understood (Du et al. (2019); Allen-Zhu et al. (2019); Gunasekar et al. (2018a;b)). However, as an increasing amount of time has been spent training deep networks, researchers and practitioners have heavily employed adaptive gradient methods, such as Adam (Kingma and Ba (2015)), Adagrad (Duchi et al. (2011)), and RMSprop (Tieleman and Hinton (2012)), due to their rapid training times (Karparthy (2017)). Unfortunately the properties of adaptive optimization algorithms are less well-understood than those of their non-adaptive counterparts. Wilson et al. (2017) provide theoretical and empirical evidence which suggests that adaptive algorithms often produce solutions that generalize worse than those found by non-adaptive algorithms.
In this paper, we study the robustness of solutions found by adaptive and non-adaptive algorithms to adversarial examples. Furthermore we study the effect of adversarial training on the geometry of the loss landscape and, consequently, on the solutions found by adaptive and non-adaptive algorithms for the adversarial training objective. Our paper makes the following contributions.
• We show an example of a learning problem for which the solution found by adaptive optimization algorithms exhibits qualitatively worse robustness properties against both - and
-adversaries than the solution found by non-adaptive algorithms. Furthermore the robustness of the adaptive solution decreases rapidly as the dimension of the problem increases, while the robustness of the non-adaptive solution is stable as the dimension increases.
• We fully characterize the geometry of the loss landscape of -adversarial training in least-squares linear regression. The
-adversarial training objective
is convex everywhere; moreover, it is strictly convex everywhere except along either 0, 1, or 2 line segments, depending on the value of
. Furthermore for nearly all choices1 of
, these line segments along which
is convex, but not strictly convex, lie outside of the rowspace and the gradient along these line segments is nonzero. It follows that any reasonable optimization algorithm finds the unique global minimum of
.
• We conduct an extensive empirical evaluation to explore the effect of different optimization algorithms on robustness. Our experimental results suggest that non-adaptive methods consistently produce more robust models than adaptive methods.
• We provide a dataset consisting of 190 pretrained models on MNIST and CIFAR10 with various hyperparameter settings. Of these 190 pretrained models, 150 were used to find the best hyperparamter settings for our experiments and evaluated on a validation set. The remaining 40 pretrained models were evaluated on the test set. Of the 150 validation models, 88 were trained using natural training and 62 were trained using adversarial training. Of the 40 test models, 20 were trained using natural training and 20 were trained using adversarial training. They can be downloaded at https://www.dropbox.com/s/edfcnb97lzxl19z/models.zip.
There has been a long line of work on the theory of adversarial examples. Schmidt et al. (2018) explore the sample complexity required to produce robust models. They demonstrate a simple setting, a mixture of two Gaussians, in which a linear classifier with near perfect natural accuracy can be learned from a single sample, but any algorithm that produces any binary classifier requires samples to produce a robust classifier. Followup work by Bubeck et al. (2019) suggests that adversarial examples may arise from computational constraints. They exhibit pairs of distributions that differ only in a k-dimensional subspace, and are otherwise standard Gaussians, and show that while it is information-theoretically possible to distinguish these distributions, it requires exponentially many queries in the statistical query model of computation. We note that both of these constructions produce distributions whose support is the entirety of
.
Bubeck et al. (2019) further characterize five mutually exclusive “worlds” of robustness, inspired by similar characterizations in complexity theory (Impagliazzo (1995)). A learning problem must fall into one of the following possibilities:
World 4: Computationally efficient robust classifiers exist and can be learned from few samples, but learning is inefficient.
While learning problems can be constructed that fall into each possible world, the question for researchers is into which world are problems from practice most likely to fall? Every theoretical construction, such as those by Schmidt et al. (2018) and Bubeck et al. (2019), can be thought of as providing evidence for the prevalence of one of the worlds. In the language of Bubeck et al. (2019), the sampling complexity result of Schmidt et al. (2018) provides evidence for World 3, by constructing an example of a problem that falls into world three. The learning problem constructed by Bubeck et al. (2019) provides evidence for World 4. Subsequent work by Degwekar et al. (2019) provides evidence for Worlds 2 and 4. Under standard cryptographic assumptions, Degwekar et al. (2019) construct an a learning problem for which a computationally efficient non-robust classifier exists, no efficient robust classifier exists, but an inefficient robust classifier exists. Similarly, assuming the existence of one-way functions, they construct a learning problem for which an efficient robust classifier exists, but it is computationally inefficient to learn a robust classifier. Finally, in an attempt to understand how likely World 4 is in practice, they show that any task where an efficient robust classifier exists but is hard to learn in polynomial time implies one-way functions.2
Additionally there is a line work that attempts to explain the pervasiveness of adversarial examples through the lens of high-dimensional geometry. Gilmer et al. (2018) experimentally evaluated the setting of two concentric under-sampled 499-spheres embedded in , and concluded that adversarial examples occur on the data manifold. Shafahi et al. (2019) suggest that adversarial examples may be an unavoidable consequence of the high-dimensional geometry of data. Their result depends upon the use of an isopermetric inequality. The main drawback of these results, as well as the constructions of Schmidt et al. (2018) and Bubeck et al. (2019), is that they assume that the support of the data distribution has full or nearly full dimension. We do not believe this to be the case in practice. Instead we believe that the data distribution is often supported on a very low-dimensional subset of
. This case is addressed in Khoury and Hadfield-Menell (2018), where they consider the problem of constructing decision boundaries robust to adversarial examples when data is drawn from a low-dimensional manifold embedded in
. They highlight the role of co-dimension, the difference between the dimension of the embedding space and the dimension of the data manifold, as a key source of the pervasiveness of adversarial vulnerability. Said differently, it is the low-dimensional structure of features embedded in high-dimensional space that contributes, at least in part, to adversarial examples. This idea is also explored in Nar et al. (2019), but with emphasis on the cross-entropy loss.
We believe that problems in practice are most likely to fall into World 5, the best of all worlds. Problems in this class have robust classifiers which are efficient to evaluate and can be learned efficiently from relatively few samples. We simply haven’t found the right algorithm for learning such classifiers. The goal of this paper is to explore the effect of our algorithms on robustness. Specifically we wish to understand the robustness properties of solutions found by common optimization algorithms. To our knowledge no other work has explored the robustness properties of solutions found by different optimization algorithms.
Wilson et al. (2017) explore the effect of different optimization methods on generalization both in a simple theoretical setting and empirically. For their main theoretical result, they construct a learning problem for which the solution found by any adaptive method, denoted , has worse generalization properties than the solution found by non-adaptive methods, denoted
. We recall their construction in Section 3.1. In Section 3.2 we describe the adaptive solution
and in Section 3.3 we describe the non-adaptive solution
.
Generalization and robustness are different properties of a classifier. A classifier can generalize well but have terrible robustness properties, as we often see in practice. On the other hand, a constant classifier generalizes poorly, but has perfect robustness (Zhang et al. (2019)). Wilson et al. (2017) study the generalization properties of and
, but not their robustness properties. In Section 3.4 we study the robustness properties of
and
. Specifically, we show that
exhibits superior robustness properties to
against both
- and
-adversaries.
3.1 A Simple Learning Problem
Let be a design matrix representing a dataset with n sample points and d features and let
be a vector of labels. Wilson et al. (2017) restrict their attention to binary classification problems of this type, and learn a classifier by minimizing the least-squares loss
They construct the following learning problem for which they can solve for both the adaptive and non-adaptive solutions in closed form. Their construction uses an infinite-dimensional feature space for simplicity, but they note that 6n dimensions suffice. For , sample
with probability p, and
with probability
for some p > 0.5. Then set
to be the infinite-dimensional vector
For example, a dataset with three sample points following Equation 3 is 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 . . .
1 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 . . .
The first feature encodes the label, and is alone sufficient for classification. Note that this trick of encoding the label is also commonly used in the robustness literature to construct examples of hard-to-learn-robustly problems (Bubeck et al. (2019); Degwekar et al. (2019)). The second and third feature are identically 1 for every sample. Then there is a subset of five dimensions which are identified with and contain a set of features which are unique to
. If
then there is a single 1 in this subset of five dimensions and
is the only sample with a 1 in this dimension. If
then all five dimensions are set to 1 and again
is the only sample with a 1 at these five positions.
While this problem may seem contrived, it contains several properties that are common in machine learning problems and that are particularly important for robustness. It contains a single robust feature that is strongly correlated with the label. However it may not be easy for an optimization algorithm to identify such a feature. Additionally there are many non-robust features which are weakly or not at all correlated with the label, but which may appear useful for generalization because they are uniquely identified with samples from specific classes. Wilson et al. (2017) show that both adaptive and non-adaptive methods find classifiers that place at least some weight on every nonzero feature.
3.2 The Adaptive Solution wada
Let (X, y) be generated by the generative model in Section 3.1. When initialized at the origin, Wilson et al. (2017) show that any adaptive optimization algorithm – such as RMSprop, Adam, and Adagrad – minimizing Equation 1 for (X, y) converges to where
Thus we can write for some positive constant
. On a test example
, that is distinct from all the training examples,
. Thus
labels every unseen example as a positive example.
3.3 The Non-adaptive Solution wSGD
For (X, y), let P, N denote the sets of positive and negative samples in X respectively. Let and note that
. When the weight vector is initialized in the row space of X, Wilson et al. (2017) show that all non-adaptive methods – such as gradient descent, SGD, SGD with momentum, Nesterov’s method, and conjugate gradient – converge to
, where
denotes the pseudo-inverse. That is, among the infinitely many solutions of the underdetermined system Xw = y, non-adaptive methods converge to the solution which minimizes
, and thus maximizes the
-margin. Specifically
where
Note that these values for differ slightly from those presented in Wilson et al. (2017). In Appendix B we discuss in detail two errors in their derivation that lead to this discrepancy. These errors do not qualitatively change their results. Furthermore, in Appendix A.2 we carefully discuss under what conditions
is positive and negative for
. For now, we simply state that for all
correctly classifies every test example.
3.4 Analyzing the Robustness of wada and wSGD
In this section we analyze the robustness properties of and
against
- and
-adversaries. We show that
exhibits considerably more robustness against both
- and
-adversaries than
. A priori this is surprising; one may have expected
, which is a small
-norm solution, to be more robust to
-perturbations, while
, which is a small
-norm solution, to be robust to
-perturbations. However this expectation is wrong. Interestingly the robustness of
against both
- and
-adversaries decreases as the dimension increases, whereas the robustness of
does not. Finally, neither method recovers the “obvious” solution
, which generalizes well and is optimally robust against both
- and
-perturbations.
Theorems 1 and 3 are our main results of this section; the proofs are deferred to Appendix A. We start by computing the robustness of against
- and
-adversaries.
Theorem 1. Let be a test sample that is correctly classified by
and let
be a perturbation. The adaptive solution
is robust against any
-perturbation for which
and any -perturbation for which
Furthermore these bounds are tight, meaning that an - or
-ball with these radii centered at
intersects the decision boundary.
Corollary 2. Asymptotically, the - and
-robustness of
are, respectively,
In particular both the - and
-robustness go to 0 as the number of samples
.
Corollary 2 makes clear, qualitatively, the result in Theorem 1. The rate at which the - and
-robustness of
decrease reflects a dependence on dimension. The number of dimensions on which
puts nonzero weight increases as we increase the number of samples, which reduces robustness. We also find it interesting that, despite classifying every test point as a positive example,
’s predictions on correctly classified test samples are brittle. In summary,
exhibits nearly no robustness against
- or
-adversaries.
Next we show that exhibits significant robustness against both
- and
-adversaries.
Theorem 3. Let be a test sample that is correctly classified by
and let
be a perturbation. The SGD solution
is robust against any
-perturbations for which
and any -perturbation for which
Furthermore these bounds are tight, meaning that an - or
-ball with these radii centered at
intersects the decision boundary.
Corollary 4. Asymptotically, the - and
-robustness of
are both
. In particular the
-robustness approaches 1 and the
-robustness approaches
as the number of samples
.
Unsurprisingly, , which maximizes the
-margin, exhibits near optimal robustness against
-adversaries. As the number of samples increases, the
-robustness of
approaches 1. Perhaps surprisingly,
also exhibits moderate robustness to
-perturbations. As the number of samples increases, the
-robustness of
approaches
. Unlike
, the amount of robustness exhibited by
does not decrease as the dimension increases, instead asymptotically approaching a constant.
However the -robustness of
is not exactly 1 for any finite sample. One class
approaches 1 from above, while the other class
approaches 1 from below. To maximize the margin,
places a small amount of weight on every other nonzero feature, even though all but the first are useless for classification. This lack of sparsity is also what causes the
-robustness to drop from a possible maximum of 1 to
. In contrast,
generalizes perfectly, has
-robustness equal to 1 for both classes, and, as an added benefit, has
-robustness equal to 1 for both classes. Thus we have an example of a problem for which the max
-margin solution could reasonably be considered to not be the best classifier against
-perturbations.
Furthermore, is not in the row space of X. (
is the projection of
onto the row space.) Thus non-adaptive methods, when restricted to the row space, are incapable of recovering
, irrespective of sample complexity (Schmidt et al. (2018)) or computational considerations (Bubeck et al. (2019)). This is simply the wrong algorithm for the desired objective. In the next section we study the effect that adversarial training has on the loss landscape and on the solutions found by various optimization algorithms.
In the previous section we presented a learning problem for which adaptive optimization methods find a solution with significantly worse robustness properties against both - and
-adversaries compared to non-adaptive methods. In this section we consider a different algorithm, adversarial training, for finding robust solutions to Equation 1. We are interested in two questions. First, does adversarial training sufficiently regularize the loss landscape so that adaptive and non-adaptive methods find solutions with identical or qualitatively similar robustness properties? Second, are the solutions to the robust objective qualitatively different than those found by natural training or does adversarial training simply choose a robust solution from the space of solutions to the natural problem? We address the first question in Section 4.2 for
-adversarial training and the second in Section 4.3 for the learning problem defined in Section 3.
4.1 The Adversarial Training Objective
Madry et al. (2018) formalize adversarial training by introducing the robust objective
where D is the data distribution, is a perturbation set meant to enforce a desired constraint, and L is a loss function. The goal then is to find a setting of the parameters w of the model that minimize the expected loss against the worst-case perturbation in
.
Take L as in Equation 1 and to be an
-ball of radius
. In the linear case, we can solve the inner maximization problem exactly.
The third identity follows from the definition of the dual norm, where denotes the norm dual to the
norm that defines
. As a technical note, it is important that sign(0) = 1 (or
) and not equal to 0. This choice represents the fact that the solution to the inner maximization problem for each individual squared term
is nonzero even if
.
At first glance the objective looks similar to ridge regression or Lasso, particularly when we consider -and
-adversarial training for which the dual norms are
and
respectively. However the solutions to this objective are not, in general, identical to the ridge regression or Lasso solutions. In Section 4.2 we will show how the second term
influences the geometry of the loss landscape when
.
4.2 The Geometry of the Loss Landscape
For the remainder of the paper we will exclusively analyze the case where is an
-ball, leaving the case of
for future work. We define the loss of interest
To build intuition, suppose that Xw = y is an underdetermined system. (Our results will not depend on this assumption.) The set of solutions is given by the affine subspace , where
is the pseudo-inverse. The first thing to notice about
is that, on its own, it is non-convex, having local minima both at the origin and in S. Along any path starting at the origin and ending at a point in S, the loss landscape induced by
is negatively curved.
The second thing to notice about is that it is non-smooth. To understand the loss landscape of Equation 12, it is crucial to understand where
is non-smooth. The term
is non-smooth at any point w where some
. Geometrically,
is the equation of a hyperplane
with normal vector
and bias
. The hyperplane
partitions
into two halfspaces
,
such that every point
has
and
has
. The set of hyperplanes
define a hyperplane arrangement H, a subdivision of
into convex cells. See Figure 1. Let
be a cell of the hyperplane arrangement3. Every point w in the interior Int C of C lies on the same side of every hyperplane
as every other point in Int C. Thus we can identify each C with a signature
for any
.
Theorem 5, our main result of this section, fully characterizes the geometry of Equation 12.
Theorem 5. is always a convex function, whose optimal solution(s) always lie in rowspace(X). There are four possible cases, three of which depend on the value of
.
1. If Xw = y is an inconsistent system, then is a strictly convex function.
2. If Xw = y is a consistent system and then
is a convex function. Moreover,
is strictly convex everywhere except along two line segments, both of which have one endpoint at the origin and terminate at
respectively, for some
. The gradient at every point on these line segments is nonzero, and so the optimal solution is found in rowspace(X) at a point of strict convexity.
Figure 1: The top leftmost figure shows a hyperplane arrangement with two lines that subdivides into four convex cells. The top center left figure shows the isocontours of
. Within each convex cell, the isocontours behave as the linear function
, where s is the signature of the cell, and are non-smooth along the two black lines. The top center right figure shows the isocontours of
, which are clearly non-convex. The top rightmost figure shows the isocontours of the function
. Notice how these isocontours are non-smooth along the two black lines and the asymmetry of the isocontours caused by the
term. The bottom row shows the graphs of the functions in the top row.
3. If Xw = y is a consistent system and , then
is a convex function. Moreover,
is strictly convex everywhere except along a single line segment with one endpoint at the origin and the other endpoint at
. The optimal solution(s) may or may not lie along this line.
4. If Xw = y is a consistent system and , then
is a strictly convex function.
The proof of Theorem 5 first characterizes the geometry of restricted to the interior of each convex cell
. We denote this function by
. If the signature s of C is not equal to
, meaning that C does not contain the origin, then
is always strongly convex. The cell C with signature
is the only cell in which
might not be strongly convex. The cases in Theorem 5 correspond to the cases that characterize the geometry of
for
. The transitions between cells are strictly convex and the subdifferential is non-empty at these transitions.
We find it very interesting that is strictly or strongly convex almost everywhere. For
,
is convex, but not strictly convex, only along two line segments which lie outside of the rowspace of X. The gradient along these line segments is nonzero, and so this particular type of convexity does not prevent an optimization algorithm from finding the unique solution in the rowspace of X. For
is convex, but not strictly convex, only along a single line segment in the rowspace of X. However the condition
can be ensured by an infinitesimal perturbation. The following remark is immediate.
Corollary 6. Suppose . Then any optimization algorithm which is guaranteed to converge to the global minimum for a strictly convex subdifferentiable function and which does not prematurely terminate at a point with nonzero gradient finds the unique global minimum of
.
Corollary 6 states that, in the linear case, any reasonable optimization algorithm finds the unique global optimum of , almost always.4 We conclude that, in the linear case,
-adversarial training does indeed sufficiently
regularize the loss landscape so that any optimization algorithm finds the same solution.
Unfortunately Theorem 5 does not give a closed-form expression for the solution(s) of . In Lemma 11 we characterize the subdifferential
at every point. However solving for w where
is similar to solving a linear program, and so we suspect that no closed-form solution exists. In Section 4.3 we discuss the solution for
in the particular case of the learning problem defined in Section 3 and show that the max-margin solution is often not the solution recovered by adversarial training.
4.3 The Solutions to the Learning Problem of Section 3
While we know of no technique to characterize the set of solutions for in general, we can still make some statements about the solution in specific instances, such as the learning problem described in Section 3. First, since the solution(s) of
must lie in the rowspace, the “obvious” solution
to the learning problem in Section 3 is not recovered by
-adversarial training. A priori, one might guess that the minimum
-norm solution
is the solution to
. However this is only true under specific conditions which depend on the class imbalance.
Theorem 7. Let (X, y) be the learning problem defined in Section 3. is a solution to
if and only if
Let c > 0 be a constant such that . If
then
is a solution to
.
While we need the first condition, which is both necessary and sufficient, to draw our forthcoming conclusion, the second, merely sufficient, condition provides greater intuition. The first condition states that is a solution if and only if
is sufficiently small, as a function of
. For the learning problem in Section 3, we know that
is achievable, so it is natural to ask how large an
is allowable by Equation 13. This relationship depends on the class imbalance, and so we set
and derive the condition in the second part of the proof of Theorem 7, which is a lower bound on the right-hand side of Equation 13. The term
is at most 1, when c = 1, but can be arbitrarily less than 1 depending on c; see Figure 2. We note also that the gap between the lower bound
and the right-hand side of Equation 13 is already small for
and vanishes
as . Thus we conclude that if
is close to 1 and the dataset is even moderately imbalanced,
, which maximizes the
-margin, is not a solution for
.
Figure 2: The bound as a function of the class imbalance c. If the classes are perfectly balanced
(c = 1), then we can take up to 1 and recover the minimum
-norm solution
. As the class imbalance increases the maximum allowable
for which we recover
decreases rapidly.
In this section we experimentally explore the effect of different optimization algorithms on robustness for deep networks. We are interested in the following questions. (1) Do different optimization algorithms give qualitatively different robustness results? (2) Does adversarial training reduce or eliminate the influence of the optimization algorithm? (3) Do adaptive or non-adaptive methods consistently outperform the other for both - and
-adversarial attacks, even if the difference is small?
Models For MNIST our model consist of two convolutional layers with 32 and 64 filters respectively, each followed by max pooling. After the two convolutional layers, there are two fully connected layers each with 1024 hidden units. For CIFAR10 we use a ResNet18 model (He et al. (2016)). We use the same model architectures for both natural and adversarial training. These models were chosen because they are small enough for us to run a large hyperparameter search.
Parameters for Adversarial Training For adversarial training we use the approach of Madry et al. (2018), and train against a projected gradient descent (PGD) adversary. For MNIST with -adversarial training, we train against a 40-step PGD adversary with step size 0.01 and maximum perturbation size of
. For
-adversarial training we train against a 40-step PGD adversary with step size 0.05 and maximum perturbation size of
. For CIFAR10 with
-adversarial training, we train against against a 10-step PGD adversary with step size 0.007(= 2/255) and maximum perturbation size of
. For
-adversarial training we train against a 10-step PGD adversary with step size 0.039(= 10/255) and a maximum perturbation size of
.
Attacks for Evaluation On MNIST we apply 100-step PGD with 10 random restarts. For we apply PGD with step sizes {0.01, 0.05, 0.1, 0.2}, and for
we apply PGD with step sizes
. On CIFAR10 we apply 20-step PGD with 5 random restarts. For
we apply PGD with step sizes {2/255, 3/255, 4/255} and for
we apply PGD with step sizes
. We also apply the gradient-free BoundaryAttack++ (Chen et al. (2019)). We evaluate these attacks per sample, meaning that if any attack successfully constructs an adversarial example for a sample x at a specific
, it reduces the robust accuracy of the model at that
.
Metrics We plot the robust classification accuracy for each attack as a function of . We are interested in both natural and adversarial training. Usually when heuristic methods for adversarial training are evaluated, they are compared at the specific
for which the model was adversarially trained. Such a comparison is arbitrary for naturally trained models and is also unsatisfying for adversarially trained models. To compare the robustness of different optimization algorithms we instead consider the area under the robustness curve. Following Khoury and Hadfield-Menell (2019), we report the normalized area under the curve (NAUC) defined as
where measures the classification accuracy. Note that NAUC
with higher values corresponding to more robust models.
Hyperparameter Selection We perform an extensive hyperparameter search over the learning rate (and if applicable momentum) parameter(s) of a each optimization algorithm to identify parameter settings which produce the most robust models. For each dataset we set aside a validation set of size 5000 from the training set. We then train models, with the architecture described above, for each of the hyperparameter settings described below for 100 epochs. All optimization algorithms are started from the same initialization. We evaluate the robustness of each model as described above on the validation set. The hyperparameter settings for each optimization algorithm with achieve the largest NAUC are then used to train new models and then evaluated on the full test set. These final results are the ones we report in this section. The hyperparameters we explored were influenced hyperparamter search of Wilson et al. (2017).
The following search space is defined for MNIST. For SGD we consider the learning rates {2, 1, 0.5, 0.1, 0.01, 0.001, 0.003, 0.0001}. For SGD with momentum we consider the set of learning rates for SGD for each of the momentum settings {0.9, 0.8, 0.7}. For Adam, Adagrad, and RMSprop we consider initial learning rates {0.1, 0.01, 0.001, 0.003, 0.0001}.
The following search space is defined for CIFAR10. For SGD we consider the learning rates {2, 1, 0.5, 0.25, 0.1}. For SGD with momentum we consider the set of learning rates for SGD for each of the momentum settings {0.9, 0.8, 0.7}. For both SGD and SGD with momentum we reduce the learning rate using the REDUCELRONPLATEAU scheduler in PyTorch. For Adam and RMSprop we consider the initial learning rates {0.005, 0.001, 0.0005, 0.0003, 0.0001, 0.00005}. For Adagrad we consider initial learning rates {0.1, 0.05, 0.01, 0.0075, 0.0005}.
Unfortunately adversarially training takes an order of magnitude longer than natural training, since in the innermost loop we must perform an iterative PGD attack to construct adversarial examples. Due to our limited resources, we consider only a subset of the hyperparameters above for adversarial training.
5.1 MNIST
Figure 3 (Top) shows the robustness of naturally trained models to - and
-adversarial attacks on MNIST. Against
-adversarial attacks, RMSprop produce the most robust model with NAUC 0.33, followed by Adam (0.27), SGD with momentum (0.26), and SGD and Adagrad (0.22). Against
-adversarial attacks, SGD produces the most robust model with NAUC 0.49, followed by SGD with momentum (0.48), Adam (0.43), and Adagrad and RMSprop (0.41).
Against -adversarial attacks, RMSprop produces a model that appears qualitatively more robust than the next best performing model. This difference can be large at specific
; for example at
, the RMSprop model maintains a robust accuracy of 22%, while the Adam model has robust accuracy 7%. This is the only instance across all of our experiment in which we observe a notable qualitative difference between different algorithms.
Figure 3 (Bottom) shows the robustness of adversarially trained models. Training adversarially improves the robustness over naturally trained models regardless of the optimization algorithm and all optimization algorithms give qualitatively similar results. For -adversarial training, SGD produces the most robust model with NAUC 0.66, followed by SGD with momentum (0.65), Adam (0.64), RMSprop (0.63), and Adagrad (0.61). For
-adversarial training SGD with momentum and RMSprop produce models with NAUC 0.56, followed by SGD (0.54), Adam (0.53), and Adagrad (0.51). For adversarial training on MNIST, either SGD or SGD with momentum were among the top performers, with Adagrad always producing the worst performing model.
Adversarial training does seem to reduce the dependence on the choice of optimization algorithm, though does not completely remove it. Against -adversarial attacks at
, the SGD model maintains robust accuracy of 91.5%, while the Adagrad model maintains robust accuracy 83.5%. We consider this difference noteworthy for MNIST. The difference is less pronounced for
-adversarial training.
SGD and SGD with momentum consistently outperform other optimization algorithms, yielding the best models in three out of four experiments. (Or one of the most robust models in the case of ties.) While the difference is qualitatively small, we believe that the consistency with which SGD or SGD with momentum produces the most robust model is noteworthy. Furthermore Adagrad seems to consistently under-perform other optimization algorithms. (Though this may depend on the domain (Tifrea et al. (2019))).
5.2 CIFAR10
Figure 4 (Top) shows the robustness of naturally trained models to - and
-adversarial attacks on CIFAR10. Against
-adversarial attacks, SGD with momentum produces the most robust model with NAUC 0.06, followed by RMSprop (0.05), Adam and Adagrad (0.04), and SGD (0.02). Against
-adversarial attacks Adagrad produces the most robust models with NAUC 0.19, followed by SGD, Adam, and RMSprop (0.17), and SGD with momentum (0.15).
Figure 4 (Bottom) shows the robustness of adversarially trained models. For -adversarial training, SGD with momentum produced the most robust model with NAUC 0.26, followed by SGD (0.24), Adam (0.23), and Adagrad and RMSprop (0.22). For
-adversarial training, SGD produces the most robust model with NAUC 0.51, followed by SGD with momentum (0.50), Adam and RMSprop (0.49), and Adagrad (0.47).
Adversarial training lessens the dependence on the choice of optimization algorithm. Against -adversarial attacks at
, the SGD with momentum model maintains robust accuracy of 45%, while the RMSprop model maintains robust accuracy 34%. The difference is less pronounced for
-adversarial training.
SGD or SGD with momentum consistently outperform other optimization algorithms, yielding the best models in three out of four experiments. Again while the difference is qualitatively small, we believe that the consistency with which SGD or SGD with momentum produces the most robust model is noteworthy.
Figure 3: Top: Robust accuracy for naturally trained MNIST models against - and
-adversarial attacks. Bottom: Robust accuracy for adversarially trained MNIST models. Left,
-adversarially trained models evaluated against
-adversarial attacks; right
.
We have presented evidence that adaptive algorithms may produce less robust solutions by giving undue influence to irrelevant features which can be exploited by an adversary. While the geometry of the loss-landscape of adversarial training in the linear regression setting is surprisingly complicated, we are still able to demonstrate the uniqueness of the optimum. In the future, it would be awesome to prove similar results for deep networks.
The Geometry of -Adversarial Training Many of the statements that we proved about
-adversarial training are not true for
-adversarial training. While we believe it’s possible to show that
-adversarial training gives a convex optimization problem using an argument nearly identical to ours, the geometry of the
-objective is much less strongly convex than the
-objective. It is unlikely that the optimal solution is unique or lies in the rowspace, for similar reasons as in Lasso regression. It would be interesting to see what, if any, statements can be made about the solutions to the
-adversarial training objective in the least-squares linear regression case.
Deep Networks An obvious future direction is to extend our theoretical analysis to deep networks. Even the case of deep linear networks is interesting and will likely require new techniques. Consider a deep linear function with l trainable layers. In this case, we can solve the inner maximization problem of Equation 12 for adversarial training, which gives the loss function
We know of no technique for analyzing the geometry of this loss function. Even when is an
-ball and there are only l = 2 trainable layers, the geometry of
is highly non-convex.
Figure 4: Top: Robust accuracy for naturally trained CIFAR10 models against - and
-adversarial attacks. Bottom: Robust accuracy for adversarially trained CIFAR10 models. Left,
-adversarially trained models evaluated against
-adversarial attacks; right
.
We believe that the primary bottleneck for understanding the generalization and robustness properties of solutions found by optimization algorithms for deep networks is an adequate set of tools for reasoning about the geometry of high-dimensional non-convex loss landscapes .Future work should attempt to fully characterize the effect of depth on the curvature of loss landscape using tools from differential geometry.
The author would like to thank Jonathan Shewchuk for providing comments on earlier drafts of this manuscript and Dylan Hadfield-Menell for providing the compute resources necessary for the experiments. This work was partially funded by NSF award CCF-1909235.
Z. Allen-Zhu, Y. Li, and Z. Song. A convergence theory for deep learning via over-parameterization. In ICML, 2019.
S. Bubeck, Y. T. Lee, E. Price, and I. P. Razenshteyn. Adversarial examples from computational constraints. In ICML, 2019.
J. Chen, M. I. Jordan, and M. J. Wainwright. Hopskipjumpattack: Query-efficient decision-based adversarial attack. CoRR, abs/1904.02144, 2019.
F. Codevilla, M. M¨uller, A. Dosovitskiy, A. L´opez, and V. Koltun. End-to-end driving via conditional imitation learning. In ICRA, 2018.
A. Degwekar, P. Nakkiran, and V. Vaikuntanathan. Computational limitations in robust classification and win-win results. In COLT, 2019.
S. S. Du, J. D. Lee, H. Li, L. Wang, and X. Zhai. Gradient descent finds global minima of deep neural networks. In ICML, 2019.
J. C. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and stochastic optimiza- tion. Journal of Machine Learning Research, 2011.
A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun. Dermatologist-level classifi- cation of skin cancer with deep neural networks. Nature, 2017.
J. Gilmer, L. Metz, F. Faghri, S. S. Schoenholz, M. Raghu, M. Wattenberg, and I. J. Goodfellow. Adversarial spheres. CoRR, abs/1801.02774, 2018. URL http://arxiv.org/abs/1801.02774.
I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In ICLR, 2014.
S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Implicit bias of gradient descent on linear convolutional networks. In NeurIPS, 2018a.
S. Gunasekar, J. D. Lee, D. Soudry, and N. Srebro. Characterizing implicit bias in terms of optimization geometry. In ICML, 2018b.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
R. Impagliazzo. A personal view of average-case complexity. In Proceedings of the Tenth Annual Structure in Complexity Theory Conference, 1995.
A. Karparthy. A peak at trends in machine learning. https://medium.com/@karpathy/ a-peek-at-trends-in-machine-learning-ab8a1085a106, 2017.
M. Khoury and D. Hadfield-Menell. On the geometry of adversarial examples. CoRR, abs/1811.00525, 2018.
M. Khoury and D. Hadfield-Menell. Adversarial training with voronoi constraints. CoRR, abs/1905.01019, 2019.
D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.
A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adver- sarial attacks. In ICLR, 2018.
M. Mirman, T. Gehr, and M. T. Vechev. Differentiable abstract interpretation for provably robust neural networks. In ICML, 2018.
K. Nar, O. Ocal, S. S. Sastry, and K. Ramchandran. Cross-entropy loss and low-rank features have responsibility for adversarial examples. CoRR, abs/1901.08360, 2019.
A. Raghunathan, J. Steinhardt, and P. Liang. Certified defenses against adversarial examples. In ICLR, 2018.
L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. In NIPS, 2018.
A. Shafahi, W. R. Huang, C. Studer, S. Feizi, and T. Goldstein. Are adversarial examples inevitable? In ICLR, 2019.
A. Sinha, H. Namkoong, and J. Duchi. Certifying some distributional robustness with principled adversarial training. In ICLR, 2018.
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013. URL http://arxiv.org/abs/1312.6199.
T. Tieleman and G. Hinton. RMSprop: Divide the gradient by a running average of its recent magnitude, 2012.
A. Tifrea, G. B´ecigneul, and O. Ganea. Poincare glove: Hyperbolic word embeddings. In ICLR, 2019.
A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht. The marginal value of adaptive gradient methods in machine learning. In NIPS, 2017.
E. Wong and J. Z. Kolter. Provable defenses against adversarial examples via the convex outer adversarial poly- tope. In ICML, 2018.
H. Zhang, Y. Yu, J. Jiao, E. P. Xing, L. E. Ghaoui, and M. I. Jordan. Theoretically principled trade-off between robustness and accuracy. In ICML, 2019.
A.1 Proof of Theorem 1
Proof. Let be an adversarial perturbation and let
be a test sample. Then
The second last equality follows from the fact that is correctly classified by
, and so
. Notice that to flip the sign of the classifier using the smallest
-perturbation, it is optimal to distribute the magnitude of the perturbation equally to each
, where the signs of each
are
for
and +1 for
. It follows that to flip the sign of the classifier requires
To find the smallest -perturbation we must instead solve the following constrained optimization problem
where is the squared-radius of the smallest
-ball that crosses the decision boundary. The La- grangian for this problem is
The partial derivatives are given by
Setting the first set of partial derivatives to 0 gives
which can then be used to solve the last equation yielding
Substituting the expression for back into Equation 16 gives
Then the minimum -perturbation R is
A.2 Proof of Theorem 3
In this case, correctly classifies
so long as
, which is true for any
. Thus
correctly classifies every test example so long as there at least one training example from each class.
We will also be interested in the signs of the individual terms in . Note that
0 for any
, and so
is positive. Lastly
so long as
, and so
if and only if
. For convenience we will assume that
from here onward which will allow us to consider fewer cases.
Proof. Let be an adversarial perturbation and let
be a test sample. Then
and
The constraint C < 0 is equivalent to . Clearly we can ensure the sign of
is negative by choosing each
opposite in sign to the term by which it is multiplied in C. Note that, by our assumptions on
and
. Thus we choose
for all
and
otherwise. Furthermore the optimal solution sets each
to the same magnitude, and so to change the sign the perturbation
must be at least
Now suppose that . In this case we need
, (equivalently C > 0). Note that in this case
is negative, and so we choose the signs of each
to match the signs of the terms by which
is multiplied. We choose
and
otherwise. Thus to change the sign the perturbation
must be at least
To find the smallest -perturbation, in the case where
, we must instead solve the following constrained optimization problem
where is the squared-radius of the smallest
-ball that crosses the decision boundary. The La- grangian for this problem is
The partial derivatives are given by
Setting the first set of partial derivatives to 0 gives
which can then be used to solve the last equation yielding
Substituting the expression for back into Equation 19 and solving for R gives
The case with is similar, but with the constraint
, which yields a similar solution for
, except that the numerator is
. Subsequently
A.3 Proof of Theorem 5
The proof of Theorem 5 is the combination of the following lemmas.
Lemma 8. Let be any vector and let
be the orthogonal projection of w onto rowspace(X). Then, for the objective function
we have that , with equality if and only if
. Hence for any optimal solution
of
.
Proof. Let be any vector
where
and
.
with equality if and only if . The third equality follows from the fact that
is in nullspace(X), the fourth from the fact that
. This proves the first statement. The second statement regarding
follows immediately.
Lemma 9. Let be a convex cell with signature s. The restriction of
to the interior of C, denoted
, is a convex function. Furthermore, if
then
is a strongly convex function.
Suppose that , meaning that C contains the origin. There are four possible cases, three of which depend on the value of
.
1. If Xw = y is an inconsistent system, then is a strongly convex function.
2. If Xw = y is a consistent system and then
is a convex function. Specifically,
is convex but not strongly convex along two line segments both of which have one endpoint at the origin and terminate at
for some
respectively. The gradient at every point on these line segments is nonzero, and so the optimal solution is found in the rowspace(X) at a point of strong convexity.
3. If Xw = y is a consistent system and , then
is a convex function. Specifically
is convex but not strongly convex along a single line segment with one endpoint at the origin and the other endpoint at
. The optimal solution may lie along this line.
4. If Xw = y is a consistent system and , then
is a strongly convex function.
Proof. Let C be any cell in the hyperplane arrangement induced by and let
denote the signature of C. We will show that the Hessian matrix within C is positive semi-definite. The Hessian matrix H(w) at a point
is
This form of the Hessian comes from twice differentiating Equation 10 and is equivalent to twice differentiating Equation 11. Note that it is crucial in the third term that the sign function is always and not defined as 0 when the input is 0; this is from where the factor of n is derived. At a high level, we examine the curvature induced by H(w) in each unit direction
at w and show that it is everywhere non-negative. It is worth taking a moment to examine how each term of H(w) affects the curvature of the objective at w.
The term is a positive semi-definite matrix and induces a quadratic form with positive curvature in each eigen-direction whose corresponding eigenvalue is positive, and zero curvature in every eigen-direction corresponding to a zero eigenvalue.
The term is a sum of outer-product matrices. Note that this matrix is symmetric, since
. This matrix has a
-dimensional nullspace, corresponding to the intersection
. On the 2-dimensional subspace spanned by
, and with respect to that basis, the outer-product has the matrix
The eigenvalues, within this subspace, are
By triangle inequality, one of these eigenvalues is always positive while the other is always negative. Thus there is one direction of positive curvature and one direction of negative curvature. The eigenvectors are
curvature induced by the term which induces positive curvature in every direction with eigenvalues all equal to
. The result of the sum of these two terms is a quadratic form which induces 0 curvature in the direction w and positive curvature in every direction orthogonal to w with eigenvalue
. Note that the value
is positive by definition, since w is in the convex cell with signature s, and so
Let be a unit vector. The curvature in the direction v is proportional (with positive constant of proportionality) to
where and
. It’s easy to see that term 2 is always greater than or equal to 0, since
, with equality when
. By the quadratic formula, term 1 is also always greater than or equal to 0, with equality when
and
; otherwise the zeros given by the quadratic formula have an imaginary component that depends on
. Thus, at this point, we see that H(w) is at least positive semi-definite in Int C.
We wish to derive under which conditions this inequality is strict, implying that H(w) is positive-definite in C. First we will show that if w is in a cell of the hyperplane arrangement whose signature is , then H(w) is positive definite. The conditions which must be true for
imply that
must be in the cell that contains the origin.
For we need both term 1 and term 2 to be equal to 0. Term 2 is equal to 0 if and only if
, which implies that
. Since v is a unit vector, we have
. For term 1 to be equal to 0 we need
and
. The first of these two conditions implies that
. Suppose that
; then
for some
. So we have that
Now, , which implies that
. If
, then
which implies that . The case where
is similar, as is the case where
. All together, we have that
.
Thus the necessary (not sufficient) conditions for can only be satisfied if
. If
then H(w) is defined and positive-definite everywhere in Int C.
We now turn our attention toward a necessary condition, which when combined with our other necessary conditions, give a set of sufficient conditions for H(w) to be positive semi-definite but not positive-definite. Suppose that and
. By the above discussion
. Suppose
and, thus,
. Under these conditions we have that
The relationship
differently. For the equation has two solutions
in the nullspace(X) with
. Since
this ray of 0 curvature lies outside of rowspace(X), and, by Lemma 8, the gradient cannot be 0 along this ray. For
, the solution is given by u = 0 and so there is a single ray in the direction of
in the rowspace(X). This ray is parameterized by
for
. The gradient may or may not be zero along this ray. Finally for
there is no solution to the relationship and
is strongly convex within C with signature
.
Before concluding we must address the fact that the Hessian H is not defined at w = 0. Let be the set of
closed orthants of
. We further subdivide C with signature
as
. Within the relative interiors of each
is twice differentiable everywhere with Hessian as described above. Thus
is convex for all i.
Let and
. Then the subdifferential
is nonempty and, in particular, contains the gradient
, which is defined at w since
and
. The intersection
for
, since
is actually differentiable at w.Now let
and
for
and such that the line segment
does not intersect the origin in its relative interior. Further suppose that
and
are adjacent along the line segment
, meaning that there is a single point
at which the line segment
leaves
and enters
. Note that
are
collinear. Then
The second equality follows from collinearity and the first inequality follows from convexity of from which we can derive
. The remaining steps are straightforward. This argument can be extended to a line segment
for w and
in two cells that only intersect at the origin in a straightforward manner using induction. Thus it follows that
is convex along
. All that remains is the case where
intersects the origin.
Suppose that , parameterized by
for
, intersects the origin. Choose any unit vector v such that v is not parallel to
. Then consider the perturbed line segment
for
. Let
be such that
. As
,
and, in particular,
. Since v is not parallel with
,
does not intersect the origin for
, and so
. Taking
convexity follows from continuity of
. Note that this approach only applies when
; however the d = 1 for
case is straightforward.
Lemma 10. is a convex function. If
for C with signature
is a strongly convex function, then
is a strictly convex function. Furthermore transitions between two cells are strictly convex.
Proof. Let be any two points. The line segment
with endpoints w and
is parameterized by
for
. If
for some C then Lemma 9 gives the results. Suppose that
and
are in distinct cells of the hyperplane arrangement and that
intersect the boundaries of these cells at
. This partitions the interval [0, 1] into m + 1 subintervals
, in each of which the function
is convex along
by Lemma 9.
Consider the base case where m = 1. The point , where the line segment
leaves C and enters
. The facet
is a
-dimensional facet, where k is the number of hyperplanes that intersect at
. Said differently, at
the signs of k hyperplane equations
flip.Imagine removing these k hyperplanes, then w and
lie in the same cell of the induced hyperplane arrangement, and, by Lemma 9, the objective function
with these k hyperplanes removed is convex. (Simply repeat the argument for
samples.) Thus we have
The second equality follows from the crucial fact that, at , each hyperplane constraint
. The first inequality follows from the convexity of
and
. The second inequality follows from adding strictly positive terms. The final equality follows by definition. With this fact we are ready to show the convexity of
along the entire segment
.
The first inequality follows from the fact that is convex along each sub-segment. The functions
are the reparameterization functions defined as
. The second inequality follows from the statement we proved about
. The final equality follows from the definitions of
. Thus
is convex along the line segment
when m = 1.
Repeating the argument inductively gives that L is convex along for any m. We have proven that the transitions between cells are strictly convex. When
restricted to each cell C is strongly convex, then the whole function
is strictly convex, otherwise
is convex.
Lemma 11. is subdifferentiable everywhere. Let
. If
for some
and
, then
is differentiable at w with
If w = 0, then the subdifferential is parameterized by replacing
in Equation 20 with any g such that
.
Otherwise , meaning that
for some
-dimensional face f of C. Let
be the k indices for which
). The subdifferential
is non-empty and is parameterized by every setting of
in Equation 20.
Proof. By Lemma 10, the epigraph of is a convex set. The Separating Hyperplane Theorem implies the existence of a supporting hyperplane at every point
. If
for some cell C in the hyperplane arrangement, then
is differentiable at w and there is a single supporting hyperplane at
. Otherwise, w is on the boundary
of some C, and the existence of a supporting hyperplane implies the existence of a subgradient of
at w.
The gradient of , where defined, is
Suppose that w = 0. Let be a unit vector and
sufficiently small. Then convexity of
for C with signature
(Lemma 9) and a standard limit argument gives
and so g induces a subgradient at w = 0 as well.
To find the subdifferential for
, we consider
for some
-dimensional facet f of C, and proceed by induction over k.
In the base case, k = 1. Since f is -dimensional, there is only one tight hyerplane equation
at w. Let
denote the hyperplane and let
denote the halfspaces in which
respectively. The limit of the gradient as approach w by a sequence in
is
where
; similarly approaching w by a sequence in
gives
where
. These vectors define two supporting hyperplanes of the epigraph at
.
Note that only and
in
depend upon s, and when
is identical regardless of the setting of
, so we need only consider
. Let
, then
where . Then we can express
, where
, as a convex combination of the terms
.
where the inequality follows from the fact that are subgradients. Thus
is a subgradient for any
at w.
Now suppose that is a
-dimensional facet and the statement holds for all
. Let
index the hyperplane equations
at w. Consider the subset of hyperplane equations
along which subgradients exist for any setting of
by the inductive hypothesis. An identical limit argument as above implies the existence of two subgradients at w with
. Then an identical calculation to those above imply that
is a subgradient for any
. Thus at
there exists a subdifferential parameterized by
for every
.
Proof of Theorem 5. Lemma 10 states that is convex and that transitions between cells are strictly convex. The cases in the theorem statement correspond to the cases in Lemma 9 which describe the geometry of the cell containing the origin. Finally Lemma 8 states that any optimal solution must be in the rowspace of X and Lemma 11 states that
is subdifferentiable everywhere.
A.4 Proof of Theorem 7
Proof. The gradient at the minimum norm solution
is
Setting the gives
Lemma 11 states that, at , there exists a subgradient for every setting of
. To prove the result we must show that there exists some setting of s that satisfies Equation 21, which we will do by showing that, under the conditions on
, the coefficient of each
on the right hand side of Equation 21 is in the range
.
Since the negative sign on the left hand side of Equation 21 is inconsequential. It is sufficient to show that
and
are in the range
. Necessity follows from the fact that the rows of X are linearly independent.
where the last inequality follows from the condition on . Note also that
by definition. The case for
is similar.
The maximum evaluates as
Within each of these ranges it can be checked, using Mathematica, that the gradient of Equation 22 is negative. Thus we can consider the limit as , which gives the lower bound
Taking is a sufficient condition, but not necessary due to the gap in the lower bound.
Wilson et al. (2017) derive the minimum norm solution using the kernel trick. The optimal solutions where
for
. They compute
and positing, correctly, that if
and
if
they derive the system of equations
which gives
Wilson et al. (2017) mistakenly dropped the negative in . Unfortunately there is an additional minor mistake in the linear system. The system is derived by computing
Subtracting equations we reach the conclusion that if
and
if
. Then it’s clear that there are really only two equations in this system
which gives