Improved Adversarial Robustness via Logit Regularization Methods

2019·arXiv

Abstract

1. Introduction

Neural networks, despite their high performance on a variety of tasks, can be brittle. Given data intentionally chosen to trick them, many deep learning models suffer extremely low performance. This type of data, commonly referred to as adversarial examples, represent a security threat to any machine learning system where an attacker has the ability to choose data input to a model, potentially allowing the attacker to control a model’s behavior.

At the same time, applications of computer vision are pervasive, with future applications including autonomous driving and medical diagnostics. While these use cases of vision are exciting in their potential for societal good, they also have the potential to be grave threats when behaving erroneously, with undesired behavior able to cause harm to both their users and creators. It is therefore of critical importance to understand how to defend against such adversarial attacks, both to prevent these systems from failing and to prevent malicious actors from exploiting any vulnerabilities they may have. Though challenging, this is nonetheless an urgent need, as it is already known that attacks targeting systems for autonomous driving and medical diagnostics are possible [6, 7].

Today, adversarial examples are typically created by small, carefully chosen transformations of data that models are otherwise high-performant on. While this is primarily due to the ease of experimentation with existing datasets [8], the full threat of adversarial examples is indeed only limited by the ability and creativity of an attacker’s example generation process – for example, even relatively basic research has shown the potential for adversarial attacks in the physical world [15], with more attacks being found on a regular basis.

Even with the limited threat models considered in current research, performance on adversarially chosen examples can be dramatically worse than unperturbed data. One canonical example is the CIFAR-10 image classification task [14], where white-box accuracy on adversarially chosen examples is lower than 50%, even for the most robust defenses known today [16, 13], while unperturbed accuracy can be as high as 98.5% [3], a difference in misclas-sification rate. On larger tasks, such as ImageNet [20], the difference is even bigger, as no model is known to be robust to any but the weakest of all adversarial attacks.

Current defenses against adversarial examples generally come in one of a few flavors. Perhaps the most common approach is to generate adversarial examples as part of the training procedure and explicitly train on them, known as “adversarial training”. Another approach is to transform the model’s input representation in a way that thwarts an attacker’s adversarial example construction mechanism. While these methods can be effective, care must be taken to make sure that they are not merely obfuscating gradients [2]. Last, generative models can be built to model the original data distribution, recognizing when the input data is out of sample and potentially correcting it [23, 21]. Of these, arguably the most robust defenses today follow the adversarial training paradigm, of which adversarial logit pairing [13] is the most recent incarnation, extending the adversarial training work of Madry et al. [16] by incorporating an additional term to make the logits (pre-softmax values) of an unperturbed and adversarial example more similar.

In this work, we show that adversarial logit pairing derives a large fraction of its benefits from regularizing the model’s logits toward zero, which we demonstrate through simple and easy to understand theoretical arguments in addition to empirical demonstration. Investigating this phenomenon further, we examine two alternatives for logit regularization, finding that both result in improved robustness to adversarial examples, sometimes surprisingly so – for example, using the right amount of label smoothing [25] can result in greater than 40% robustness to a 10-step projected gradient descent (PGD) attack [16] on CIFAR-10 while training only on the original, unperturbed training examples, and is also a compelling black-box defense. We then present an alternative formulation of adversarial logit pairing that separates the logit pairing and logit regularization effects, improving the defense. The end result of these investigations is a defense that outperforms state-of-the-art approaches for PGD-based adversaries on CIFAR-10 for both white-box and black-box attacks, while requiring little to no computational overhead on top of adversarial training.

2. Overview of Adversarial Training

Before proceeding with our analysis, we review existing work on adversarial training for context. While adversarial examples have been examined in the machine learning community in some capacity for many years [4], their study has drawn a sharp focus in the current renaissance of deep learning, starting with Szegedy et al. [26] and Goodfellow et al. [9], particularly in the context of computer vision. In Goodfellow et al. [9], adversarial training is presented as training with a weighted loss between an original and adversarial example, i.e. with a loss of

where g(x) is a function representing the adversarial example generation process, originally presented as g(x) = signis a weighting term between the original and adversarial examples typically set to 0.5, are the model parameters to learn, J is a cross-entropy loss, m is the dataset size, is the ith input example, and is its label. Due to the use of a single signed gradient with respect to the input example, this method was termed the “fast gradient sign method” (FGSM), requiring a single additional forward and backward pass of the network to create. Kurakin et al. [15] extended FGSM into a multistep attack, iteratively adjusting the perturbation applied to the input example through several rounds of FGSM. This was also the first attack that could be described as a variant of projected gradient descent (PGD), where the adversarial perturbation is initialized to zero. Both of these approaches primarily target an threat model, where the norm between the original and adversarial example is constrained to a small value. By keeping the norm small, it is assumed that the adversarial example will have the same correct label as the original example, i.e. that the perturbation is small enough to still be easily recognizable as the original category, an assumption that allows for research in the field without requiring the manual annotation of every new adversarial example.

Madry et al. [16] built upon these works by initializing the search process for the adversarial perturbation randomly, and is among the strongest attacks currently available. Although only a slight modification of Kurakin et al. [15], this detail is critical – with a zero initialization it is easy to become robust only at existing training points, thus causing a “gradient masking” effect [2]. Through extensive experiments, they showed that even performing PGD with a single random initialization is able to approximate the strongest adversary found with current first-order methods, and doing adversarial training with this attack resulted in the most robust model yet. However, as with multistep FGSM, performing adversarial training with this approach can be rather expensive, taking an order of magnitude longer than standard training. Specifically, PGD-based adversarial training requires N + 1 forward and backward passes of the model, where N is the number of PGD iterations, and is typically on the order of 5 to 20 [16].

Improving on PGD-based adversarial training, Kannan et al. [13] introduced adversarial logit pairing (ALP), which adds a term to the adversarial training loss function that encourages the model to have similar logits for original and adversarial examples:

where L was set to an loss and returns the logits of the model corresponding to example x. Adversarial logit pairing has the motivation of increasing the amount of structure given to the model in the learning process by encouraging the model to have similar prediction patterns on the original and adversarial examples, a process reminis- cent of distillation [11].

Kannan et al. [13] also studied a baseline version of ALP, called “clean logit pairing”, which paired randomly chosen unperturbed examples together. Surprisingly, this worked reasonably well, inspiring them to experiment with a similar idea they call “clean logit squeezing”, regularizing the norm of the model’s logits, which worked even more effectively, though this idea itself was not combined with adversarial training. It is this aspect of the work that is most related to what we study in this paper.

Last, it is worth noting work examining the reproducibility of adversarial logit pairing [5]. While it is true that ALP was found to not actually be robust on ImageNet to multistep white-box attacks with a large number of iterations, continuing the trend of no models being robust on ImageNet, the improved robustness of ALP on smaller datasets that are more commonly used was not refuted, a fact which we also find, even with attacks of up to 1,000 steps. Thus, we believe that ALP shows some promise in advancing our understanding and effectiveness in defending against adversarial examples.

3. Adversarial Logit Pairing and Logit Regularization

We now show how adversarial logit pairing [13] acts as a logit regularizer. For notational convenience, denote as the logit of the model for class c on example i in its original, unperturbed form, and as the logit for the corresponding adversarial example. The logit pairing term in adversarial logit pairing is a simple loss:

While it is obvious that minimizing this term will have the effect of making the original and adversarial logits more similar in some capacity, what precise effect does it have on the model during training? To examine the effect of such a loss in gradient-based training, the dominant training paradigm for almost all computer vision models today, we can look at the gradient of this loss term with respect to the logits themselves:

Under the assumption that the adversarial example moves the model’s predictions away from the correct label (as should be the case with any reasonable adversarial example, such as an untargeted PGD-based attack), we will have that when is the correct category, and otherwise1. Keeping in mind that model updates move in the direction opposite of the gradient, then the update to the model’s weights will attempt to make the original logits smaller and the adversarial logits larger when and will otherwise attempt to make the original logits larger and the adversarial logits smaller.

However, it is not sufficient to examine this in isolation, as logit pairing is only one component of a typical adverarial loss: it must be considered in the context of the adversarial training loss – in particular, the cross-entropy loss used in for the adversarial example already encourages the adversarial logits to be higher for the correct category and smaller for all incorrect categories, and furthermore the scale of the loss typically is an order of magnitude larger than the adversarial pairing loss. Thus, we argue that the main effect of adversarial logit pairing is actually in the remaining two types of updates, encouraging the logits of the original example to be smaller for the correct category and larger for all incorrect categories – an effect which is essentially regularizing model logits in a manner similar to “logit squeezing” [13] or label smoothing [25].

Examining this further, we can also take a different perspective by explicitly incorporating the scale of the logits in the logit pairing term. If we factor out a shared scale factor from each logit, the logit pairing term has the form

implying that

Since is non-negative, this means that the model will always attempt to update the scale of the logits in the opposite direction of its sign, which is necessarily an update moving toward zero so long as the logits were not identical – in fact, if this were the only term in the loss, then it is easy to see that would be a global minimizer of the loss. However, in practice this affect is partially counterbalanced by the adversarial training term, which requires that logits across different categories be different in order to minimize its cross-entropy loss.

Given this interpretation, in this work we now explore four key questions: 1) Experimental verification of our analysis. In practice, how much of a logit regularization effect does ALP have? 2) Do other forms of logit regularization have similar effects on adversarial robustness? If so, then an entire family of methods for improving adversarial ro- bustness will have been found. 3) Is it possible to decouple adversarial logit pairing explicitly into a form where the effect of logit regularization and pairing can be disentangled? 4) Finally, using these insights, can we discover even more robust models?

3.1. Experimental Evidence

Perhaps the most straightforward way to test our hypothesis is to examine the logits of a model trained with ALP vs one trained with standard adversarial training. If our hypothesis is true, then the model trained with ALP will have logits that are generally smaller in magnitude. We present the results of this experiment in Figure 1(left), using an 18-layer ResNet [10] classifier trained on CIFAR-10 [14] as our experimental testbed (see Sec. 5.1 for details).

As shown in Figure 1, it is indeed the case that the logits for a model trained with ALP are of smaller magnitude than those of a model trained with PGD, with a variance reduction of the logits from 8.31 to 4.02 on clean test data2. Though this provides evidence that ALP does have the effect of regularizing logits, this data alone is not sufficient to determine if logit regularization is a key mechanism in ALP’s improved adversarial robustness.

To answer this, we examine if standard adversarial training can be improved by explicitly regularizing the logits. If adversarial robustness can be improved, but similar improvements can not be made to ALP, then at least some of the benefits of ALP can be attributed strictly to logit regularization. We present the results of this experiments in Figure 1(right), implemented using the “logit squeezing” form of regularization (-regularization on the logits).

As shown, we find that incorporating regularization on model logits is able to recover slightly more than half of the total improvement from logit pairing, with a unimodal distribution – too little regularization has only a small effect, and too much regularization approaches the point of being harmful. In contrast, when added to a model already trained with ALP, regularizing the logits does not lead to any improvement at all, and in fact hurts performance at all levels of regularization strength, likely due to the combination of explicit logit regularization and the implicit regularization happening from ALP overpowering the cross-entropy loss of adversarial training. This evidence makes clear that one of the key improvements from logit pairing is due to a logit regularization effect.

We would like to emphasize that these results are not meant to diminish ALP in any sense – our goals are to investigate the mechanism by which it works and explore if it can be generalized or improved. Thus, given these results, it is worth examining other methods that have an effect of

Figure 1. Left: Distribution of logits on clean test data for models trained with and without logit pairing. Right: Performance against a 10-step PGD attack for models trained with varying amounts of logit regularization, with and without logit pairing.

regularizing logits in order to tell whether this is a more general phenomenon.

4. Other forms of logit regularization

Label Smoothing. Label smoothing is the process of replacing the one-hot training distribution of labels with a softer distribution, where the probability of the correct class has been smoothed out onto the incorrect classes [25]. Concretely, label smoothing uses the target distribution:

where is the target probability for class c for example i, the number of categories is denoted by C, and is the smoothing strength. Label smoothing was originally introduced as a form of regularization, designed to prevent models from being too confident about training examples, and had the goal of improved generalization. Furthermore, it can be easily implemented as a preprocessing step on the labels, and does not affect model training time in any sig-nificant way. Interestingly, Kurakin et al. [15] found that incorporating a small amount of label smoothing present in a model trained on ImageNet actually decreased adversarial robustness roughly by 1%. Here we find a different effect.

In Figure 2(left) we show the effect label smoothing has on the performance of a model trained only on clean (i.e. non-adversarial) training data. Very surprisingly, using only label smoothing was able to produce a model that is nearly as robust to this 10-step PGD attack as models trained with PGD-based adversarial training or adversarial logit pairing, both of which take an order of magnitude more time to train – though in faireness we note that when PGD and ALP-based models are trained only on adversarial examples rather than a mixture of clean and adversarial data, their robustness exceeds this performance by around 5%. Furthermore, this benefit of label smoothing comes at no significant loss in accuracy on unperturbed test data, while generally adversarial training tends to trade off original vs

Figure 2. Left: Clean and adversarial accuracy on CIFAR-10 as a function of the label smoothing strength. Note that models for this figure were trained exclusively on the original training data – no adversarial examples were involved in the training procedure. Top-Right: Logit distribution of model trained with no label smoothing (middle) and with label smoothing of s = .75 (right), evaluated on the original test images. Bottom-Right: Logit distributions when evaluated on PGD-based adversarial examples.

adversarial performance. Another curiosity is that adding in any label smoothing at all dramatically improves robustness to FGSM-based adversaries (adding label smoothing of s = .01 brought accuracy up from 6.1% to 38.3%), while PGD-based attacks saw much more gradual improvement with label smoothing strength. While remarkable, this property of label smoothing on the loss surface is eludes understanding, warranting further research.

Examining the logits themselves (Figure 2, right), we see a striking difference between the models – the model trained with label smoothing both has a dramatically smaller dynamic range of logits – roughly 1.2 vs. 20, a 16-fold decrease – and also presents a much more bimodal logit distribution than the model trained without label smoothing. In other words, it has learned to predict extremely consistent values for logits, a property that may contribute to its adversarial robustness. Anecdotally, we observed that this behavior held for all positive values of s, with a stronger effect the higher s was.

This behavior can be explained: when trained with no label smoothing, the cross-entropy loss used in most models encourages model output to be as close to a one-hot distribution as possible, predicting a probability of 1 for the correct category and 0 for all other categories. When viewed as logits instead of probabilities, this corresponds to pushing the logits for the correct and incorrect categories as far apart as possible. However, models trained with label smoothing are instead encouraged to produce a soft distribution, with no probabilities too close to either 0 or 1, corresponding to a bounded target logit difference which gets smaller with increasing s.

Additional experiments involving label smoothing are given in Section 5.

Paired-Example Data Augmentation Recently, a new form of data augmentation was found that stands in contrast to standard label-preserving data augmentation. Pairedexample data augmentation consists of combining different training examples together, dramatically altering both the appearance of the training examples and their labels. Introduced concurrently by multiple groups [29, 12, 27], these types of data augmentation typically have the form of element-wise weighted averaging of two input examples (typically images), with the training label also determined as a weighted average of the original two training labels (represented as one-hot vectors). Besides making target labels soft (i.e. not 1-of-K) during training time, these methods also encourage models to behave linearly between examples, which may improve robustness to out of sample data. Interestingly, Zhang et al. [29] found that this type of data augmentation improved robustness to FGSM-based attacks on ImageNet [20], but Kannan et al. [13] found that the method did not improve robustness against a targeted attack with a stronger PGD-based adversary.

Experimentally, we found evidence agreeing with both conclusions – when applying mixup [29], we found a sizeable increase in robustness to FGSM adversaries, going from 6.1% on CIFAR-10 by training without mixup to 30.8% with mixup, but did not observe a significant change when evaluated against a PGD-based adversary. While robustness to a PGD adversary with only 5 steps increased by a tiny amount (from 0.0% to 0.5%), robustness to a 10-step PGD adversary remained at 0%. In our experiments, we use VH-mixup, the slightly improved version of mixup introduced by Summers and Dinneen [24].

4.1. Decoupling Adversarial Logit Pairing

While we have now considered alternate methods by which logits can be regularized, at this point it is still not clear exactly how they might be used with or interact with the logit regularization effect of adversarial logit pairing. Doing so requires decoupling the logit pairing and logit regularization effects of ALP.

In adversarial logit pairing [13], the logit pairing term is implemented as an loss:

though other losses such as an or Huber loss are also possible. Expanding this creates a form that makes the pairing and regularization terms evident:

where the first and third terms are explicit logit regularization terms on and , and the logit pairing effect is only determined by the middle inner product. While using an loss is a natural loss for regularization purposes, the pairing term can be improved by considering a more general form:

where h has the express goal of making the logits more similar (with as little logit regularization as possible), and the regularization terms have been grouped with a controllable weighting factor. There are several natural choices for h, such as the Jensen-Shannon divergence, a cosine similarity, or any similarity metric that does not have a significant regularization effect. We have found that simply taking the cross entropy between the distributions induced by the logits was effective – depending on the actual values of the logits, this can either still have a mild squeezing effect (if the logits are very different), a mild expanding effect (if the logits are very similar), or something in between.

One implementation detail worth noting is that it can be difficult to reason about and set the relative strengths of the pairing loss and adversarial training loss. To that end, we set the strength of the pairing loss h as a constant fraction of the adversarial loss, implemented by setting the coefficient of the loss as a constant multiplied by a non-differentiable version of the ratio between the losses.

By decomposing adversarial logit pairing explicitly into logit pairing and logit regularization terms in this way, adversarial robustness to a 10-step PGD attack improves by an absolute 1.9% over ALP, or 5.6% over PGD-based adversarial training.

5. Additional experiments

In this section we present additional experiments on three datasets: The primary two datasets are CIFAR-10 and CIFAR-100, which are datasets for 10-way and 100-way classification, respectively, each with 50,000 examples, on which we evaluate both white-box and black-box adversarial attacks. Additionally, we evaluate on SVHN [17], which has significantly greater scale in training examples (604,388) and whose 10 classes have an uneven distribution, with the most common class roughly 3 times more common than the least common class.

5.1. Implementation Details

In the experiments throughout this paper on CIFAR-10/100, we used an 18-layer ResNet [10], equivalent to the “simple” model of Madry et al. [16], with a weight decay of and a momentum optimizer with strength of 0.9. Standard data augmentation of random crops and horizontal flips was used. After a warm up period of 5 epochs, the learning rate peaked at 0.1 and decayed by a factor of 10 at 100 and 150 epochs, training for a total of 200 epochs for models not trained on adversarial examples and 101 epochs for models using adversarial training – adversarial accuracy tends to increase for a brief period of time after a learning rate decay, then quickly drop by a small amount, an empirical finding also echoed by Schmidt et al. [22]. The minibatch size was 128.

Adversarial examples were constrained to a maximum norm of .03, and all PGD-based attacks used a step size of 0.0078. For our implementation of adversarial logit pairing, on CIFAR-10 we used a pairing coefficient of 0.5 as recommended by [13], and found a larger coefficient of 5.0 more effective on CIFAR-100.

On SVHN, a smaller 8-layer ResNet was used for computational efficiency due to the scale of the dataset, which is a considerable challenge to overcome with adversarial training. Models were trained for 101 epochs, with a learning rate of .001 for the first 5 epochs, .01 until epoch 80, and .001 afterward. Data augmentation consisted of random crops after an initial 4 pixel padding. Adversarial attacks on SVHN were performed with an constraint on the perturbation of 12/255 with a step size of 3/255. All adversarial attacks were constructed using the CleverHans library [18], implemented in TensorFlow [1], and all experiments were done on two Nvidia Geforce GTX 1080 Ti GPUs.

Table 1. White-box accuracy of models on CIFAR-10.

5.2. Towards a more robust model

Given these forms of logit regularization, perhaps the most natural question is whether they can be combined to create an even more robust model. Thus, in this section we focus exclusively on making a model (and comparable baselines) as robust as possible to PGD-based attacks. In particular, for baseline methods (PGD-based adversarial training [16] and adversarial logit pairing [13]), we opt to train exclusively on adversarial examples, effectively setting in Equation 1, which roughly trades off accuracy for clean test examples for a similar gain in adversarial performance.

To combine the logit regularization methods together, on CIFAR-10 and CIFAR-100 we use a modest amount of label smoothing (s = 0.1) and use VH-mixup [24] on the input examples. For the logit pairing formulation presented in Section 4.1, we found different hyperparameters worked best on our two evaluation datasets. On CIFAR-10, we set , and set the ratio between the adversarial training loss and the pairing loss to 0.125, which focuses the loss on keeping adversarial and original examples similar. On CIFAR-100, the more challenging dataset, we use and a ratio of 0.95, which allows the network to focus more on fitting the data while still maintaining a balance with defending against adversarial examples. On SVHN, we set s = 0.2, use , employ regular mixup [29] (which is more suitable for the imagery of SVHN), and use a ratio of 0.95, similar to CIFAR-100. We note that these parameters were not tuned much due to resource constraints. We refer to this combination simply as LRM (“Logit Regularization Methods”).

CIFAR-10 White-box performance on CIFAR-10 is shown in Table 1. LRM achieves the highest level of adversarial robustness of the methods considered for all PGDbased attacks, and to the best of our knowledge represents the most robust method on CIFAR-10 to date. However, like other adversarial defenses, this comes at the cost of performance on the original test set, which makes sense – from the perspective of adversarial training, a clean test image is simply the center of the set of feasible adversarial examples. Nonetheless, it is interesting that the tradeoff between adversarial and non-adversarial performance can continue to be pushed further, with the optimal value of that tradeoff dependent on application, i.e. whether worst-case perfor-

Table 2. Black-box accuracy of models on CIFAR-10.

Table 3. White-box accuracy of models on CIFAR-100.

mance is more important than performance on the original unperturbed examples.

Next, black-box performance is shown in Table 2. As is standard in most black-box evaluations of adversarial defenses, this is performed by generating adversarial examples with one model (the “Source”) and evaluating them on a separate independently trained model (the “Target”). In this experiment, we use a 10-step PGD attack to generate adversarial examples. As found in other works [16], the success of a black-box attack depends both on how similar the training procedure was between the source and target models and on the strength of the source model – for example, LRM uniformly results in a stronger black-box attack than ALP [13], which itself is a uniformly stronger black-box attack than adversarial training with PGD [16]. As such, using LRM as the source mildly damages the black-box defenses of PGD and ALP.

Interestingly, label smoothing was fairly effective as a black-box defense, being among the most robust models across all different sources. In particular, label smoothing had the highest minimum performance across sources (over 10% higher than any other method), which is particularly surprising given its near-zero cost compared to the adversarially-trained models.

CIFAR-100 White-box performance on CIFAR-100 is presented in Table 3. Again, we find that LRM achieves the highest level of adversarial robustness to all adversarial attacks, with ALP also strictly better than PGD-based adversarial training. Interesting, though, we find that label smoothing completely fails to all attacks on CIFAR-100, behaving almost completely differently than on CIFAR-10. Although examining this was not the goal of our work, this does highlight the importance of evaluating proposed adversarial defenses on multiple datasets.

The corresponding results for black-box attacks on CIFAR-100 are shown in Table 4, where we again find clear differences across datasets. This time, the difference is that all methods perform much more similarly to one another, with the exception of clear differences of transferring from PGD-trained models to non-adversarially trained models.

SVHN We demonstrate performance against white-box attacks on SVHN in Table 5. Despite the large differences in dataset scale, image type, and imbalance in class frequencies, we again find that our method, LRM, is the most robust of all appoaches on every adversarial attack, with patterns in the performance of each defense very similar to the other datasets, providing additional evidence that our methods and insights are generalizable.

5.3. Evaluating Stronger Attacks

Evaluating adversarial defenses is difficult to do correctly – since evaluating against any attack merely provides an upper bound on adversarial robustness, it is critical to evaluate on the strongest attacks available to make the bound as tight as possible. Furthermore, care must be taken to avoid gradient masking or obfuscated gradients [2], which can lead to a false sense of security.

Here we evaluate white-box performance on CIFAR-10 with two very strong attacks: a 1,000-step PGD adversary (the same attack that ALP succumbed to on ImageNet), and SPSA [28], a gradient-free attack that is effective at uncovering gradient masking. Results are given in Table 6. Note that SPSA is evaluated against a representative 1,000-image sample of the evaluation set for efficiency, since a full evaluation would take roughly 90 hours, and that we use the same evaluation settings as provided in [28].

We find nearly no difference when going from a 20-step to a 1,000-step PGD attack for all methods except for label smoothing, which loses most of its robustness. This suggests that label smoothing, while providing only a mild amount of worst-case adversarial robustness, can actually make the adversarial optimization problem much more challenging, which we believe is also the underlying reason for its effectiveness against black-box attacks. Based on this conjecture, we also evaluated label smoothing as a black-box defense with a 1,000-step PGD attack, where we have found a much smaller drop in performance, going from 67.3% to 60.8%, confirming that label smoothing still has its place in black-box defenses. The exact mechanism by which label smoothing makes the search for adversarial examples more difficult, however, remains elusive, which we think is an interesting avenue for further research.

On the other hand, an SPSA attack removes some of the difference in robustness between PGD, ALP, and LRM . While this illustrates that ALP and LRM are likely doing some type of gradient masking in a way that PGD cannot detect, even with 1,000 iterations, it also illustrates that there is a real gain in adversarial robustness even when considering strong gradient-free attacks.

Table 4. Black-box accuracy of models on CIFAR-100.

Table 5. White-box accuracy of models on SVHN.

Table 6. Evaluating models against the strongest white-box attacks on CIFAR-10. SPSA is evaluated on a 1,000-image (10%) subsample, and a 20-step PGD attack is provided for context.

6. Discussion

In this work, we have shown the usefulness of logit regularization for improving the robustness of neural models of computer vision to adversarial examples. We first presented an analysis of adversarial logit pairing, showing that roughly half of its improvement over adversarial training can be attributed to a non-obvious logit regularization effect. Based on this, we investigated two other forms of logit regularization, demonstrating the benefits of both, and then presented an alternative method for adversarial logit pairing that more cleanly decouples the logit pairing and logit regularization effects while also improving performance.

By combining these logit regularization techniques together, we were able to create both a stronger defense against white-box PGD-based attacks and also a stronger attack against PGD-based defenses, both of which come at almost no additional cost to PGD-based adversarial training. We also demonstrated the surprising strength of label smoothing as a black-box defense and its paradoxical weakness to highly-optimized white-box attacks.

We anticipate that future work will push the limits of logit regularization even further to improve defenses against adversarial examples, possibly drawing on techniques originally devised for other purposes [19]. We also hope that these investigations will yield insights into training adversarially-robust models without the overhead of multistep adversarial training, an obstacle that has made it challenge to scale up adversarial defenses to larger datasets without very large computational budgets.

References

[1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-flow: A system for large-scale machine learning. In OSDI, volume 16, pages 265–283, 2016. 6

[2] A. Athalye, N. Carlini, and D. Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. In International Conference on Machine Learning, 2018. 1, 2, 8

[3] E. D. Cubuk, B. Zoph, D. Mane, V. Vasudevan, and Q. V. Le. Autoaugment: Learning augmentation policies from data. arXiv preprint arXiv:1805.09501, 2018. 1

[4] N. Dalvi, P. Domingos, S. Sanghai, D. Verma, et al. Ad- versarial classification. In Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining, pages 99–108. ACM, 2004. 2

[5] L. Engstrom, A. Ilyas, and A. Athalye. Evaluating and un- derstanding the robustness of adversarial logit pairing. arXiv preprint arXiv:1807.10272, 2018. 3

[6] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T. Kohno, and D. Song. Robust physical-world attacks on deep learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1625–1634, 2018. 1

[7] S. G. Finlayson, J. D. Bowers, J. Ito, J. L. Zittrain, A. L. Beam, and I. S. Kohane. Adversarial attacks on medical machine learning. Science, 363(6433):1287–1289, 2019. 1

[8] J. Gilmer, R. P. Adams, I. Goodfellow, D. Andersen, and G. E. Dahl. Motivating the rules of the game for adversarial example research. arXiv preprint arXiv:1807.06732, 2018. 1

[9] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In International Conference on Learning Representations, 2015. 2

[10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learn- ing for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 4, 6

[11] G. Hinton, O. Vinyals, and J. Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 3

[12] H. Inoue. Data augmentation by pairing samples for images classification. arXiv preprint arXiv:1801.02929, 2018. 5

[13] H. Kannan, A. Kurakin, and I. Goodfellow. Adversarial logit pairing. arXiv preprint arXiv:1803.06373, 2018. 1, 2, 3, 5, 6, 7, 8

[14] A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009. 1, 4

[15] A. Kurakin, I. Goodfellow, and S. Bengio. Adversarial exam- ples in the physical world. arXiv preprint arXiv:1607.02533, 2016. 1, 2, 4

[16] A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu. Towards deep learning models resistant to adversarial attacks. In International Conference on Learning Representations, 2018. 1, 2, 6, 7, 8

[17] Y. Netzer, T. Wang, A. Coates, A. Bissacco, B. Wu, and A. Y. Ng. Reading digits in natural images with unsupervised feature learning. 2011. 6

[18] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Fein- man, A. Kurakin, C. Xie, Y. Sharma, T. Brown, A. Roy, A. Matyasko, V. Behzadan, K. Hambardzumyan, Z. Zhang, Y.-L. Juang, Z. Li, R. Sheatsley, A. Garg, J. Uesato, W. Gierke, Y. Dong, D. Berthelot, P. Hendricks, J. Rauber, and R. Long. Technical report on the cleverhans v2.1.0 adversarial examples library. arXiv preprint arXiv:1610.00768, 2018. 6

[19] G. Pereyra, G. Tucker, J. Chorowski, Ł. Kaiser, and G. Hin- ton. Regularizing neural networks by penalizing confident output distributions. In International Conference on Learning Representations, 2017. 8

[20] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, 115(3):211–252, 2015. 1, 5

[21] P. Samangouei, M. Kabkab, and R. Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. In International Conference on Learning Representations, 2018. 2

[22] L. Schmidt, S. Santurkar, D. Tsipras, K. Talwar, and A. Madry. Adversarially robust generalization requires more data. In Advances in Neural Information Processing Systems, 2018. 6

[23] Y. Song, T. Kim, S. Nowozin, S. Ermon, and N. Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. In International Conference on Learning Representations, 2018. 2

[24] C. Summers and M. J. Dinneen. Improved mixed-example data augmentation. In IEEE Winter Conference on Applications of Computer Vision, 2019. 6, 7

[25] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016. 2, 3, 4

[26] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014. 2

[27] Y. Tokozume, Y. Ushiku, and T. Harada. Between-class learning for image classification. In Computer Vision and Pattern Recognition, 2018. 5

[28] J. Uesato, B. O’Donoghue, A. v. d. Oord, and P. Kohli. Ad- versarial risk and the dangers of evaluating against weak attacks. arXiv preprint arXiv:1802.05666, 2018. 8

[29] H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz. mixup: Beyond empirical risk minimization. In International Conference on Learning Representations, 2018. 5, 7

Designed for Accessibility and to further Open Science