Robustness to Programmable String Transformations via Augmented Abstract Training

2020·Arxiv

Abstract

Abstract

Deep neural networks for natural language processing tasks are vulnerable to adversarial input perturbations. In this paper, we present a versatile language for programmatically specifying string transformations—e.g., insertions, deletions, substitutions, swaps, etc.—that are relevant to the task at hand. We then present an approach to adversarially training models that are robust to such user-defined string transformations. Our approach combines the advantages of search-based techniques for adversarial training with abstraction-based techniques. Specifically, we show how to decompose a set of user-defined string transformations into two component specifications, one that benefits from search and another from abstraction. We use our technique to train models on the AG and SST2 datasets and show that the resulting models are robust to combinations of user-defined transformations mimicking spelling mistakes and other meaning-preserving transformations.

1. Introduction

Deep neural networks have proven incredibly powerful in a huge range of machine-learning tasks. However, deep neural networks are highly sensitive to small input perturbations that cause the network’s accuracy to plummet (Carlini & Wagner, 2017; Szegedy et al., 2013). In the context of natural language processing, these adversarial examples come in the form of spelling mistakes, use of synonyms, etc.—essentially, meaning-preserving transformations that cause the network to change its prediction (Ebrahimi et al., 2018; Zhang et al., 2019a; Michel et al., 2019).

In this paper, we are interested in the problem of training models over natural language—or, generally, sequences over

1Department of Computer Science, University of WisconsinMadison, Madison, USA. Correspondence to: Yuhao Zhang <yuhaoz@cs.wisc.edu>.

Proceedings of the International Conference on Machine Learning, Online, PMLR 119, 2020. Copyright 2020 by the author(s).

a finite alphabet—that are robust to adversarial examples. Sequences over finite alphabets are unique in that the space of adversarial examples is discrete and therefore hard to explore efficiently using gradient-based optimization as in the computer-vision setting. The common approach to achieving robustness is adversarial training (Goodfellow et al., 2015; Madry et al., 2018), which has seen a great deal of research in computer vision and, more recently, in natural language processing (Ebrahimi et al., 2018; Zhang et al., 2019a; Michel et al., 2019). Suppose we have defined a space of perturbations R(x) of a sample x—e.g., if x is a sentence, R(x) contains every possible misspelling of words in x, up to some bound on the number of misspellings. The idea of adversarial training is to model an adversary within the training objective function: Instead of computing the loss for a sample (x, y) from the dataset, we compute the loss for the worst-case perturbed sample . Formally, the adversarial loss for

The question we ask in this paper is:

Can we train models that are robust against rich perturbation spaces over strings?

The practical challenge in answering this question is computing the worst-case loss. This is because the perturbation space R(x) can be enormous and therefore impractical to enumerate. This is particularly true for NLP tasks, where the perturbation space R(x) should contain inputs that are semantically equivalent to x—e.g., variations of the sentence x with typos or words replaced by synonyms. Therefore, we need to approximate the adversarial loss. There are two such classes of approximation techniques:

Augmentation The first class of techniques computes a lower bound on the adversarial loss by exploring a finite number of points in R(x). This is usually done by applying a gradient-based attack, like HotFlip (Ebrahimi et al., 2018) for natural-language tasks or PGD (Madry et al., 2018) for computer-vision tasks. We call this class of techniques augmentation-based, as they essentially search for a perturbed sample with which to augment the training set.

Abstraction The second class of techniques computes an

Figure 1. Illustration of augmentation, abstraction, and A3T

upper bound on the adversarial loss by overapproximating, or abstracting, the perturbation space R(x) into a set of symbolic constraints that can be efficiently propagated through the network. For example, the interval abstraction has been used in numerous works (Mirman et al., 2018; Gowal et al., 2019; Huang et al., 2019). We call this class of techniques abstraction-based.

Both classes of techniques can produce suboptimal results: augmentation can severely underapproximate the worst-case loss and abstraction can severely overapproximate the loss. Particularly, we observe that the two techniques have complementary utility, working well on some perturbation spaces but not others—for example, Huang et al. (2019) have shown that abstraction works better for token substitutions, while augmentation-based techniques like HotFlip (Ebrahimi et al., 2018) and MHA (Zhang et al., 2019a) are general—e.g., apply to token deletions and insertions.

1.1. Our Approach

A hybrid approach We propose augmented abstract adversarial training (A3T), an adversarial training technique that combines the strengths of augmentation and abstraction techniques. The key idea underlying A3T is to decompose the perturbation space into two subsets, one that can be explored using augmentation and one that can be abstracted— e.g., using augmentation to explore word duplication typos and abstraction to explore replacing words with synonyms. From an algorithmic perspective, our computation of adversarial loss switches from a concrete, e.g., gradient-based, search through the perturbation space to a symbolic search. As such, for every training sample (x, y), our technique may end up with a lower bound or an upper bound on its adversarial loss (see Fig. 1).

A language for specifying string transformations The challenge of applying A3T is how to exactly decompose the perturbation space. Our key enabling idea is to de-fine the perturbation space programmatically, in a way that can be easily and cleanly decomposed. Specifically, we define a general language in which we can specify a perturbation space by a specification S in the form of , containing a set of string transformations . The specification S defines a perturbation space of all possible strings by applying each transformation times. For example, given a string x, S(x) could define the set of all strings that are like x but with some words replaced by one of its synonyms and with some stop words removed.

Given a perturbation space defined by a set of transformations, A3T decomposes the set of transformations into two disjoint subsets, one that is explored concretely (augmentation) and one that is explored symbolically (abstraction).

Results We have implemented A3T and used it to train NLP models for sentiment analysis that are robust to a range of string transformations—e.g., character swaps modeling spelling mistakes, substituting of a word with a synonym, removing stop words, duplicating words, etc. Our results show that A3T can train models that are more robust to adversarial string transformations than those produced using existing techniques.

1.2. Summary of Contributions

• We present A3T, a technique for training models that are robust to string transformations. A3T combines search-based attacks and abstraction-based techniques to explore the perturbation space and compute good approximations of adversarial loss.

• To enable A3T, we define a general language of string transformations with which we can specify the perturbation space. A3T exploits the specification to decompose and search the perturbation space.

• We implement A3T1 and evaluate it on two datasets and a variety of string transformations. Our results demonstrate the increase in robustness achieved by A3T in comparison with state-of-the-art techniques.

2. Related Work

Adversarial text generation Zhang et al. (2019b) presented a comprehensive overview of adversarial attacks on neural networks over natural language. In this paper, we focus on the word- and character-level. HotFlip (Ebrahimi et al., 2018) is a gradient-based approach that can generate the adversarial text in the perturbation space described by word- and character-level transformations. MHA (Zhang et al., 2019a) uses Metropolis-Hastings sampling guided by gradients to generate word-level adversarial text via word substitution. Karpukhin et al. (2019) designed character- level noise for training robust machine translation models. Other work focuses on generating adversarial text on the sentence-level (Liang et al., 2018) or paraphrase-level (Iyyer et al., 2018; Ribeiro et al., 2018). Also, some works (Zhao et al., 2018; Wang et al., 2019) try to generate natural or grammatically correct adversarial text.

Abstract training Mirman et al. (2018) and Gowal et al. (2019) first proposed DiffAI and interval bound propagation (IBP) to train image classification models that are provably robust to norm-bounded adversarial perturbations. They performed abstract training by optimizing the abstract loss obtained by Interval or Zonotope propagation. Mirman et al. (2019) used DiffAI to defend deep residual networks provably. Jia et al. (2019) used interval domain to capture the perturbation space of substitution and train robust models for CNN and LSTM. Huang et al. (2019) proposed a simplex space to capture the perturbation space of substitution. They converted the simplex into intervals after the first layer of the neural network and obtained the abstract loss by IBP. We adopt their abstract training approach for some of our transformations. We will show the limitation of abstract training for more complex perturbations like the combination of swap and substitution.

Other robustness techniques Other techniques to ensure robustness involve placing a spelling-mistake-detection model that identifies possible adversaries before the underlying model (Pruthi et al., 2019; Sakaguchi et al., 2017).

Formal verification for neural networks In the verifica-tion for NLP tasks, Shi et al. (2020) combined forward propagation and a tighter backward bounding process to achieve the formal verification of Transformers. Welbl et al. (2020) proposed the formal verification under text deletion for models based on the popular decomposable attention mechanism by interval bound propagation. POPQORN (Ko et al., 2019) is a general algorithm to quantify the robustness of recurrent neural networks, including RNNs, LSTMs, and GRUs. COLT (Balunovic & Vechev, 2020) combines formal verification and adversarial training to train neural networks. In this paper, we mix verification techniques, namely, interval propagation, with search-based techniques.

3. The Perturbation-Robustness Problem

In this section, we (1) formalize the perturbation-robustness problem and (2) define a string transformation language for specifying the perturbation space.

3.1. Perturbation Robustness

Classification setting We consider a standard classification setting with samples from some domain X and labels from Y. Given a distribution D over samples and labels, our goal is to find the optimal parameters of some neural-network

architecture that minimize the expected loss

We are interested in the setting where the sample space X defines strings over some finite alphabet . The alphabet can be, for example, English characters (in a character-level model) or entire words (in a world-level model). Therefore, the domain X in our setting is , i.e., the set of all strings of elements of . We will use to denote a string and to denote the ith element of the string.

Perturbation space We define a perturbation space R as a function in takes a string x and returns a finite set of possible perturbed strings of x.

We will use a perturbation space to denote a set of strings that should receive the same classification by our network. For example, R(x) could define a set of sentences paraphrasing x. We can thus modify our training objective into a robust-optimization problem, following Madry et al. (2018):

This inner objective is usually hard to solve; in our setting, the perturbation space can be very large and we cannot afford to consider every single point in that space during training. Therefore, as we discussed in Section 1, typically approximations are made.

Exhaustive accuracy Once we have trained a model using the robust optimization objective, we will use exhaustive accuracy to quantify its classification accuracy in the face of perturbations. Specifically, given a dataset and a perturbation space R, we define exhaustive accuracy as follows:

Intuitively, for each sample , its classification is considered correct iff predicts for every single point in . We use exhaustive accuracy instead of the commonly used adversarial accuracy because (1) exhaustive accuracy provides the ground truth accuracy of the discrete perturbation spaces and does not depend on an underlying adversarial attack, and (2) the discrete spaces make it easy for us to compute exhaustive accuracy by enumeration and at the same time hard for the gradient-based adversarial attacks to explore the space.

3.2. A Language for Specifying Perturbations

We have thus far assumed that the perturbation space is provided to us. We now describe a language for modularly specifying a perturbation space.

A specification S is defined as follows:

where each denotes a string transformation that can be applied up to times. Formally, a string transformation predicate (in practice, a regular expression) describing the substrings of the inputs to which the transformation can be applied, and transformer describing how the substrings matched by can be replaced.

Single transformations Before defining the semantics of our specification language, we illustrate a few example spec-ifications involving single transformations:

Example 1 (Suppose we want to de-fine a transformation that deletes a stop word—and, the, is, etc.—mimicking a typo. The predicate will be a regular expression matching all stop words. The transformer will be simply the function that takes a string and returns the set containing the empty string, . Consider a specification that applies the transformation up to one time. On the following string, They are at school, the predicate matches the substrings are and at. In both cases, we apply the predicate to the matched word and insert the output of in its position. This results in the set containing the original string (0 transformations are applied) and the two strings They at school and They are school. Applying a specifica-tion , which is allowed to apply at most twice, to the same input would result in a set of strings containing the strings above as well as the string They school.

Example 2 (). Say we want to transform occurrences of nice into one of its synonyms, enjoyable and pleasant. We define the predicate that is true iff x = nice, and we define . Given the string This is nice!, it will be transformed into the set {This is enjoyable!, This is pleasant!}.

Applying a specification to the same input would result in the same set of strings above. Because the predicate only matches the word nice.

Example 3 (Now consider the case where we would like to swap adjacent vowels. will be defined as the regular expression that matches two adjacent characters that are vowels. Next, since can only have length 2, the transformer will be the swap function

Multiple transformations As discussed above, a spec-ification S in our language is a set of transformations where each is a pair . The formal semantics of our language can be found in the supplementary Appendix. Informally, a string z is in the perturbation space S(x) if it can be obtained by (1) finding a set of non-overlapping substrings of x that match the various predicates and such that at most substrings in are matches of , and (2) replacing each substring matched by with a string in . The complexity of the formalization is due to the requirement that matched substrings should not overlap—this requirement guarantees that each character in the input is only involved in a single transformation and will be useful when formalizing our abstract training approach in Section 4.2.

Example 4 (Multiple Transformations). Using the transformations we can define the specification

Then, This house is nice) results in the set of strings:

This house is nice This house is enjoyable This house is pleasant This huose is nice This huose is enjoyable This huose is pleasant

The transformed portions are shown in bold. Note that we apply up to 1 of each transformation, thus we also get the original string. Also, note that the two transformations cannot modify overlapping substrings of the input; for example, did not swap the ea in pleasant.

4. Augmented Abstract Adversarial Training

In this section, we describe our abstract training technique, A3T, which combines augmentation and abstraction.

Recall the adversarial training objective function, Eq. (2). The difficulty in solving this objective is the inner maximization objective: where the perturbation space R(x) can be intractably large to efficiently enumerate, and we therefore have to resort to approximation. We begin by describing two approximation techniques and then discuss how our approach combines and extends them.

Augmentation (search-based) techniques We call the first class of techniques augmentation techniques, since they search for a worst-case sample in the perturbation space R(x) with which to augment the dataset. The na¨ıve way is to simply enumerate all points in R(x)—our specifications induce a finite perturbation space, by construction. Unfortunately, this can drastically slow down the training. For example, suppose T defines a transformation that swaps two adjacent characters. On a string of length N, the specifica-tion (T, 2) results in transformations.

An efficient alternative, HotFlip, was proposed by Ebrahimi et al. (2018). HotFlip efficiently encodes a transformation T as an operation over the embedding vector and approximates the worst-case loss using a single forward and backward pass through the network. To search through a set of transformations, HotFlip employs a beam search of some size k

to get the top-k perturbed samples. This technique yields a point in R(x) that may not have the worst-case loss. Alternatives like MHA (Zhang et al., 2019a) can also be used as augmentation techniques.

Abstraction techniques Abstraction techniques compute an over-approximation of the perturbation space, as a symbolic set of constraints. This set of constraints is then propagated through the network, resulting in an upper bound on the worst-case loss. Specifically, given a transformation T, we define a corresponding abstract transformation such that for all x, the constraint

Our use of abstraction builds upon the work of Huang et al. (2019), which uses an interval domain to define — i.e., is a conjunction of constraints on each character. We will describe how we generalize their approach in Section 4.2; for now, we assume that we can efficiently overapproximate the worst-case loss for by propagating it through the network.

4.1. A3T: A High-Level View

The key idea of A3T is to decompose a specification S into two sets of transformations, one containing transformations that can be effectively explored with augmentation and one containing transformations that can be precisely abstracted.

Algorithm 1 shows how A3T works. First, we decompose the specification S into two subsets of transformations, resulting in two specifications, and . For , we apply an augmentation technique, e.g., HotFlip or MHA, to come up with a list of top-k perturbed samples in the set —this is denoted as the set augment

Then, for each point z in the top-k results, we compute an abstraction , which is a set of constraints over-approximating the set of points in . Recall our overview in Fig. 1 for a visual depiction of this process.

Finally, we return the worst-case loss.

4.2. Computing Abstractions

We now show how to define the abstraction of a perturbation space S(x) defined by a specification . Our approach generalizes that of Huang et al. (2019) to length-preserving transformations, i.e., ones

Figure 2. Illustration of an abstraction of a single (left) and multiple (right) transformations. See Example 5 and Example 6 for details.

where the length of every string in S(x) is the same as the length of the original string x. The approach of Huang et al. (2019) targeted the special case of single-character substitutions.

Single transformation case We first demonstrate the case of a single length-preserving transformation, Henceforth we assume that each element of a string x is a real value, e.g., the embedding of a character or word. At a high level, our abstraction computes the convex hull that contains all the points in T(x) (we use T(x) as a short hand for the perturbation space obtained by applying T to x exactly once) and then scales this convex hull by to account for the cases in which T is applied up to times. We begin by computing all points in T(x). Let this set be . Next, for , we define the set of points

We then construct the abstraction as the convex hull of the points and x. Observe that we only need to enumerate the space T(x) obtained by applying T once; multiplying by the convex hull to include all strings that involve up to applications of T, i.e., S(x). To propagate this convex hull through the network, we typically overapproximate it as a set of interval constraints, where each dimension of a string is represented by a lower and an upper bound. Interval constraints are easier to propagate through the network—requiring several forward passes linear in the length of the string—compared to arbitrary convex polyhedra, whose operations can be exponential in the number of dimensions (Cousot & Halbwachs, 1978).

Example 5. Consider the left side of Fig. 2. Say we have the string ab and the transformation T that can replace character a with z, or b with g—mimicking spelling mistakes, as these characters are adjacent on a QWERTY keyboard. The large shaded region is the result of dilating T with , i.e., contains all strings that can be produced by applying T twice to the string ab, namely, the string zg.

General case We generalize the above abstraction process to the perturbation space . First, we enumerate all the strings in . (Notice that we need only consider each transformation independently.) Let this set be . Next, for , we define the following set of points:

As with the single-transformation case, we can now construct an abstraction of the convex hull induced by as a set of intervals and propagate it through the network.

Example 6. We illustrate the process of abstracting an input string cc on the right of Figure 2 for the specification , where maps one character to its preceding character in the alphabet order, e.g., c with b, and maps one character to its succeeding character in the alphabet order, e.g., c with d. We enumerate all the points in and compute their convex hull, shown as the inner shaded region. This region includes all strings resulting from exclusively one application of or . Next, we dilate the convex hull by 2 = 1 + 1 times to include all the points in the perturbation space of . This is shown as the larger shaded region. Notice how this region includes bd and db, which result from an application of to the first character and to the second character of the original string cc.

The following theorem states that this process is sound: produces an overapproximation of a perturbation space S(x).

Theorem 1. For every specification S and input x, the abstracted perturbation space abstract(S, x) is an over-approximation of

We prove Theorem 1 in the Appendix.

5. Experiments

In this section, we evaluate A3T by answering the following research questions:

• RQ1: Does A3T improve robustness in rich perturbation spaces for character-level and word-level models?

• RQ2: How does the complexity of the perturbation space affect the effectiveness of A3T?

5.1. Experimental Setup

5.1.1. DATASETS AND MODELS

We use two datasets:

• AG News (Zhang et al., 2015) dataset consists of a corpus of news articles collected by Gulli (2005) about the four largest news topics. We used the online-available dataset2, which contains 30,000 training and 1,900 test-

ing examples for each class. We split the first 4,000 training examples for validation.

• SST2 (Socher et al., 2013) is the Stanford Sentiment Treebank dataset that consists of sentences from movie reviews and human annotations of their sentiment. The task is to predict the sentiment (positive/negative) of a given sentence. We used the dataset provided by TensorFlow3. The dataset contains 67,349 training, 872 validation, and 1,821 testing examples for each class.

For the AG dataset, we trained a smaller character-level model than the one used in Huang et al. (2019) but kept the number of layers and the data preprocessing the same. For the SST2 dataset, we trained a word-level model and a character-level model. We used the same models in Huang et al. (2019), also following their setup. We provide the details of the setups in the Appendix.

5.1.2. PERTURBATIONS

Our choice of models allows us to experiment with both character-level and word-level perturbations. We evaluated A3T on six perturbation spaces constructed using the seven individual string transformations in Table 1.

For the character-level model on dataset AG, we used the following specifications: , , and . For example, the first specification mimics the combination of two spelling mistakes: swap two characters up to twice and/or substitute a character with an adjacent one on the keyboard up to twice.

For the word-level model on dataset SST2, we used the following specifications: , , and . For example, the first specification removes stop words up to twice and substitutes up to twice words with synonyms.

For the character-level model on dataset SST2, we used the following specifications: , , and . For example, the first specification mimics the combination of two spelling mistakes: swap two characters and/or substitute a character with an adjacent one on the keyboard.

For the character-level model on AG dataset, we considered the perturbations to be applied to a prefix of an input string, namely, a prefix length of 35 for , a prefix length of 30 for , and . For the character-level model on SST2 dataset, we considered

Table 1. String transformations to construct the perturbation spaces for evaluation.

Table 2. Qualitative examples. The vanilla models correctly classify the original samples but fail to classify the perturbed samples.

perturbations with but allow the perturbations to be applied to the whole input string. We made these restrictions because one cannot efficiently evaluate the exhaustive accuracy with larger , due to the combinatorial explosion of the size of the perturbation space.

5.1.3. TRAINING METHODS

We implement and compare the following training methods.

Normal training is the vanilla training method (Eq. (1)) that minimizes the cross entropy between predictions and target labels. This method does not use the perturbation space and does not attempt to train a robust model.

Random augmentation performs adversarial training (Eq. (2)) using a weak adversary that simply picks a random perturbed sample from the perturbation space.

HotFlip augmentation performs adversarial training (Eq. (2)) using the HotFlip (Ebrahimi et al., 2018) attack to solve the inner maximization problem.

A3T(HotFlip) uses HotFlip to find the worst-case samples for augmentation, while (2) A3T(search) performs an explicit search through the perturbation space to find the worst-case samples for augmentation. Finally, we used DiffAI (Mirman et al., 2018) to perform abstract training for the transformations in , using the intervals abstraction.

In all augmentation training baselines, and A3T, we also adopt a curriculum-based training method (Huang et al., 2019; Gowal et al., 2019) which uses a hyperparameter weigh between normal loss and maximization objective in Eq. (2).

5.1.4. EVALUATION METRICS

Normal accuracy is the vanilla accuracy of the model on the test set.

HotFlip accuracy is the adversarial accuracy of the model with respect to the HotFlip attack, i.e., for each point in the test set, we apply the HotFlip attack and test if the classification is still correct.

Exhaustive accuracy (Eq 3) is the worst-case accuracy of the model: a prediction on (x, y) is considered correct if and only if all points lead to the correct prediction.

By definition, HotFlip accuracy is an upper bound on exhaustive accuracy.

Table 3. Experiment results for the three perturbations on the character-level model on AG dataset. We show the normal accuracy (Acc.), HotFlip accuracy (HF Acc.), and exhaustive accuracy (Exhaustive) of five different training methods.

Table 4. Experiment results for the three perturbations on the word-level model on SST dataset.

5.2. Evaluation Results

RQ1: Increase in robustness We show the results for the selected perturbation spaces on character-level and word-level models in Tables 3, 4, and 5, respectively.

Compared to normal training, the results show that both A3T(HotFlip) and A3T(search) increase the exhaustive accuracy and can improve the robustness of the model. A3T(HotFlip) and A3T(search) also outperform random augmentation and HotFlip augmentation. In particular, A3T(search) has exhaustive accuracy that is on average 20.3 higher than normal training, 14.6 higher than random augmentation, and 6.7 higher than HotFlip augmentation.

We also compared A3T to training using only abstraction (i.e., all transformations in S are also in ) for the specification on AG dataset and on SST2 dataset (not shown in Tables 3, 4, and 5); this is the only specification that can be fully trained abstractly since it only uses length-preserving transformations. Training using only abstraction yields an exhaustive accuracy of on AG dataset, which is similar to the exhaustive accuracy of A3T(HotFlip) (86.4) and A3T(search) (86.8). However, training using only abstraction yields an exhaustive accuracy of 47.0 for on SST2 dataset, which is better than the one obtained using normal training, but much lower than the exhaustive accuracy of A3T(HotFlip) and A3T(search). Furthermore, the normal accuracy of the abstraction technique on SST2 dataset drops to 58.8 due to the over-approximation of the perturbation space while A3T(HotFlip) (73.6) and A3T(search) (70.2) retain high normal accuracy.

To answer RQ1, A3T yields models that are more robust to complex perturbation spaces than those produced by augmentation and abstraction techniques. This result holds for both character-level and word-level models.

RQ2: Effects of size of the perturbation space In this section, we evaluate whether A3T can produce models that are robust to complex perturbation spaces. We fix the word-level model A3T (search) trained on . Then, we test this model’s exhaustive accuracy on (Figure 3(a)) and (Figure 3(b)), where we vary the parameters and between 1 and 4, increasing the size of the perturbation space. (The Appendix contains a more detailed evaluation with different types of transformations.) We only consider word-level models because computing the exhaustive accuracy requires us to enumerate all the elements in the perturbation space. While enumeration is feasible for word-level transformations (e.g., the perturbation space of for a string with 56 tokens contains at most 68,002 perturbed samples), enumeration is infeasible for character level transformations (e.g., the perturbation space of prefix of length 30 of string with 300 characters contains 7,499,469 perturbed samples, and the perturbation space of for a string with 300 characters

Table 5. Experiment results for the three perturbations on the character-level model on SST2 dataset.

Figure 3. The exhaustive accuracy of , varying the parameters (left) and (right) between 1 and 4.

contains 20,252,321,116 perturbed samples!).

The exhaustive accuracy of A3T(HotFlip) and A3T(search) decreases by 1.6% and 1.1%, respectively, when increasing from 1 to 4, and decreases by 2.1% and 1.4%, respectively, when increasing from 1 to 4. All other techniques result in larger decreases in exhaustive accuracy (in and in

To answer RQ2, even in the presence of large perturbation spaces A3T yields models that are more robust than those produced by augmentation techniques.

6. Conclusion, Limitations, and Future Work

We presented an adversarial training technique, A3T, combining augmentation and abstraction techniques to achieve robustness against programmable string transformations in neural networks for NLP tasks. In the experiments, we showed that A3T yields more robust models than augmentation and abstraction techniques.

We foresee many future improvements to A3T. First, A3T cannot currently generalize to RNNs because its abstraction technique can only be applied to models where the first layer is an affine transformation (e.g., linear or convolutional layer). Applying A3T to RNNs will require designing new abstraction techniques for RNNs. Second, we manually split S into and . Performing the split automatically is left as future work. Third, A3T(search) achieves the best performance by looking for the worst-case perturbed sample in the perturbation space of enumeration. In some practical settings, might induce a large perturbation space, and it might be best to use A3T(HotFlip) instead. Fourth, we choose HotFlip and interval abstraction to approximate the worst-case loss in our experiments, but our approach is general and can benefit from new augmentation and abstraction techniques.

Acknowledgements

We would like to thank Samuel Drews for his feedback, as well as the reviewers. This work is supported by the National Science Foundation grants CCF-1704117, CCF-1918211, and Facebook research award Probability and Programming.

References

Balunovic, M. and Vechev, M. T. Adversarial training and provable defenses: Bridging the gap. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020, 2020. URL https://openreview.net/forum ?id=SJxSDxrKDr.

Carlini, N. and Wagner, D. A. Adversarial examples are not easily detected: Bypassing ten detection methods. In Proceedings of the 10th ACM Workshop on Artifi-cial Intelligence and Security, AISec@CCS 2017, Dallas, TX, USA, November 3, 2017, pp. 3–14, 2017. doi: 10.1145/3128572.3140444. URL https://doi.or g/10.1145/3128572.3140444.

Cousot, P. and Halbwachs, N. Automatic discovery of linear restraints among variables of a program. In Proceedings of the 5th ACM SIGACT-SIGPLAN symposium on Principles of programming languages, pp. 84–96, 1978.

Ebrahimi, J., Rao, A., Lowd, D., and Dou, D. Hot-flip: White-box adversarial examples for text classifi-cation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 2: Short Papers, pp. 31–36, 2018. doi: 10.18653/v1/ P18-2006. URL https://www.aclweb.org/ant hology/P18-2006/.

Goodfellow, I. J., Shlens, J., and Szegedy, C. Explaining and harnessing adversarial examples. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/ 1412.6572.

Gowal, S., Dvijotham, K., Stanforth, R., Bunel, R., Qin, C., Uesato, J., Arandjelovic, R., Mann, T. A., and Kohli, P. Scalable verified training for provably robust image classification. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pp. 4841–4850, 2019. doi: 10.1109/ICCV.2019.00494. URL https: //doi.org/10.1109/ICCV.2019.00494.

Gulli, A. The anatomy of a news search engine. In Proceedings of the 14th international conference on World

Wide Web, WWW 2005, Chiba, Japan, May 10-14, 2005 - Special interest tracks and posters, pp. 880–881, 2005. doi: 10.1145/1062745.1062778.

Huang, P., Stanforth, R., Welbl, J., Dyer, C., Yogatama, D., Gowal, S., Dvijotham, K., and Kohli, P. Achieving verified robustness to symbol substitutions via interval bound propagation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 4081– 4091, 2019. doi: 10.18653/v1/D19-1419. URL https: //doi.org/10.18653/v1/D19-1419.

Iyyer, M., Wieting, J., Gimpel, K., and Zettlemoyer, L. Ad- versarial example generation with syntactically controlled paraphrase networks. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), pp. 1875– 1885, 2018. URL https://www.aclweb.org/a nthology/N18-1170/.

Jia, R., Raghunathan, A., G¨oksel, K., and Liang, P. Certified robustness to adversarial word substitutions. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLPIJCNLP 2019, Hong Kong, China, November 3-7, 2019, pp. 4127–4140, 2019. doi: 10.18653/v1/D19-1423. URL https://doi.org/10.18653/v1/D19-1423.

Karpukhin, V., Levy, O., Eisenstein, J., and Ghazvinine- jad, M. Training on synthetic noise improves robustness to natural noise in machine translation. In Proceedings of the 5th Workshop on Noisy User-generated Text, W-NUT@EMNLP 2019, Hong Kong, China, November 4, 2019, pp. 42–47, 2019. doi: 10.18653/v1/ D19-5506. URL https://doi.org/10.18653/ v1/D19-5506.

Kingma, D. P. and Ba, J. Adam: A method for stochastic op- timization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http://arxiv.org/abs/1412.6980.

Ko, C., Lyu, Z., Weng, L., Daniel, L., Wong, N., and Lin, D. POPQORN: quantifying robustness of recurrent neural networks. In Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, pp. 3468–3477, 2019. URL http://proceedings.mlr.press/ v97/ko19a.html.

Liang, B., Li, H., Su, M., Bian, P., Li, X., and Shi, W. Deep text classification can be fooled. In Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, IJCAI 2018, July 13-19, 2018, Stockholm, Sweden, pp. 4208–4215, 2018. doi: 10.24963/ijcai.2018/585. URL https://doi.org/ 10.24963/ijcai.2018/585.

Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. Towards deep learning models resistant to adversarial attacks. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net /forum?id=rJzIBfZAb.

Michel, P., Li, X., Neubig, G., and Pino, J. M. On evaluation of adversarial perturbations for sequence-to-sequence models. CoRR, abs/1903.06620, 2019. URL http://arxiv.org/abs/1903.06620.

Mirman, M., Gehr, T., and Vechev, M. Differentiable abstract interpretation for provably robust neural networks. In International Conference on Machine Learning (ICML), 2018. URL https://www.icml.cc/Conf erences/2018/Schedule?showEvent=2477.

Mirman, M., Singh, G., and Vechev, M. T. A provable defense for deep residual networks. CoRR, abs/1903.12519, 2019. URL http://arxiv.org/ abs/1903.12519.

Pavlick, E., Rastogi, P., Ganitkevitch, J., Van Durme, B., and Callison-Burch, C. PPDB 2.0: Better paraphrase ranking, fine-grained entailment relations, word embeddings, and style classification. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pp. 425–430, Beijing, China, July 2015. Association for Computational Linguistics. doi: 10.3115/v1/P15-2070. URL https://www.aclweb .org/anthology/P15-2070.

Pennington, J., Socher, R., and Manning, C. D. Glove: Global vectors for word representation. In Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543, 2014. URL http://www.aclweb.org /anthology/D14-1162.

Pruthi, D., Dhingra, B., and Lipton, Z. C. Combating ad- versarial misspellings with robust word recognition. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5582–5591, 2019. doi: 10.18653/v1/p19-1561. URL https://doi.org/10.18653/v1/p19-1561.

Ribeiro, M. T., Singh, S., and Guestrin, C. Semantically equivalent adversarial rules for debugging NLP models. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, pp. 856–865, 2018. doi: 10.18653/v1/ P18-1079. URL https://www.aclweb.org/ant hology/P18-1079/.

Sakaguchi, K., Duh, K., Post, M., and Durme, B. V. Robsut wrod reocginiton via semi-character recurrent neural network. In Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA, pp. 3281–3287, 2017. URL http://aaai.org/ocs/index.php /AAAI/AAAI17/paper/view/14332.

Shi, Z., Zhang, H., Hsieh, C.-J., Chang, K.-W., and Huang, M. Robustness verification for transformers. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum ?id=BJxwPJHFwS.

Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A. Y., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1631–1642, 2013. URL https: //www.aclweb.org/anthology/D13-1170/.

Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013.

Wang, B., Pei, H., Liu, H., and Li, B. Advcodec: Towards A unified framework for adversarial text generation. CoRR, abs/1912.10375, 2019. URL http://arxiv.org/ abs/1912.10375.

Welbl, J., Huang, P.-S., Stanforth, R., Gowal, S., Dvi- jotham, K. D., Szummer, M., and Kohli, P. Towards verified robustness under text deletion interventions. In International Conference on Learning Representations, 2020. URL https://openreview.net/forum ?id=SyxhVkrYvr.

Zhang, H., Zhou, H., Miao, N., and Li, L. Generating fluent adversarial examples for natural languages. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 5564–5569, 2019a. URL https://www.aclweb.o rg/anthology/P19-1559/.

Zhang, W. E., Sheng, Q. Z., Alhazmi, A., and LI, C. Adver- sarial attacks on deep learning models in natural language processing: A survey. arXiv preprint arXiv:1901.06796, 2019b.

Zhang, X., Zhao, J. J., and LeCun, Y. Character-level convo- lutional networks for text classification. In Advances in Neural Information Processing Systems 28: Annual Conference on Neural Information Processing Systems 2015, December 7-12, 2015, Montreal, Quebec, Canada, pp. 649–657, 2015. URL http://papers.nips.cc/p aper/5782-character-level-convolution al-networks-for-text-classification.

Zhao, Z., Dua, D., and Singh, S. Generating natural ad- versarial examples. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018. URL https://openreview.net /forum?id=H1BLjgZCb.

A. Appendix

A.1. Semantics of specifications

We define the semantics of a specification S = (such that ) as follows. Given a string , a string y is in the perturbations space S(x) if:

1. there exists matches assume that matches are sorted in ascending order of ) such that for every we have that is a valid match of

2. the matches are not overlapping: for every two distinct

3. the matches respect the constraints: for every

4. the string y is the result of applying an appropriate transformation to each match: if for every we have

A.2. Proof of Theorem 1

We give the following definition of a convex set:

Definition 1. Convex set: A set C is convex if, for all x and y in C, the line segment connecting x and y is included in C.

Proof. We first state and prove the following lemma.

Lemma 2. Given a set of points and a convex set C such that . These points define a set of vectors . If a vector can be represented as a sum weighed by

where respect to constraints:

then the point p is also in the convex set C.

Proof. We prove this lemma by induction on t,

• Base case: t = 1, if and , then p is on the segment . By the definition of the convex set (Definition 1), the segment is inside the convex, which implies p is inside the convex:

• Inductive step: Suppose the lemma holds for t = r. If a vector can be represented as a sum weighed by

This inequality enables the inductive hypothesis, and we know point is in the convex set C. From Eq 11, we know that the point p is on the segment of , since both two points are in the convex set C, then the point p is also inside the convex set C.

To prove Theorem 1, we need to show that every perturbed sample lies inside the convex hull of abstract(S, x).

We first describe the perturbed sample y. The perturbed sample y as a string is defined in the semantics of specifi-cation S (see the Appendix A.1). In the rest of this proof, we use a function mapping from a string with length m to a point in -dimensional space, e.g., E(y) represents the point of the perturbed sample y in the embedding space. We use to represent the string perturbed by a transformation is a valid match of

We further define as the vector

A perturbed sample y defined by matches and for every we have

The matches respect the constraints: for every , . Thus, the size of the matches k also respect the constraints:

In the embedding space,

Thus, we can represent

We then describe the convex hull of abstract(S, x). The convex hull of abstract(S, x) is constructed by a set of points , where points are computed by:

Alternatively, using the definition of

We then prove the Theorem 1. To prove E(y) lies in the convex hull of abstract(S, x), we need to apply Lemma 2. Notice that a convex hull by definition is also a convex set. Because from Eq 14, we have

We can use Eq 15 into the above equation, and have

To apply Lemma 2, we set

Using Inequality 13 on

we get

The constraints in Inequality 16 and Inequality 17 enable Lemma 2, and by applying Lemma 2, we know that point E(y) is inside the convex hull of abstract(S, x).

A.3. Details of Experiment Setup

For AG dataset, we trained a smaller character-level model than the one used in Huang et al. (2019). We followed the setup of the previous work: use lower-case letters only and truncate the inputs to have at most 300 characters. The model consists of an embedding layer of dimension 64, a 1-D convolution layer with 64 kernels of size 10, a ReLU layer, a 1-D average pooling layer of size 10, and two fullyconnected layers with ReLUs of size 64, and a linear layer. We randomly initialized the character embedding and updated it during training.

For SST2 dataset, we trained the same word-level model as the one used in Huang et al. (2019). The model consists of an embedding layer of dimension 300, a 1-D convolution layer with 100 kernels of size 5, a ReLU layer, a 1-D average pooling layer of size 5, and a linear layer. We used the pretrained Glove embedding (Pennington et al., 2014) with dimension 300 and fixed it during training.

For SST2 dataset, we trained the same character-level model as the one used in Huang et al. (2019). The model consists of an embedding layer of dimension 150, a 1-D convolution layer with 100 kernels of size 5, a ReLU layer, a 1-D average pooling layer of size 5, and a linear layer. We randomly initialized the character embedding and updated it during training.

For all models, we used Adam (Kingma & Ba, 2015) with a learning rate of 0.001 for optimization and applied early stopping policy with patience 5.

We provide the details of the string transformations we used:

• : We allow each character substituting to one of its adjacent characters on the QWERTY keyboard.

• : We choose {and, the, a, to, of} as our stop words set.

• : We use the synonyms provided by PPDB (Pavlick et al., 2015). We allow each word substituting to its closest synonym when their part-of-speech tags are also matched.

A.3.2. BASELINE

Random augmentation performs adversarial training using a weak adversary that simply picks a random perturbed sample from the perturbation space. For a specification , we produce z by uniformly sampling one string from a string transformation and passing it to the next transformation , where we then sample a new string , and so on until we have exhausted all transformations. The objective function is the following:

HotFlip augmentation performs adversarial training using the HotFlip (Ebrahimi et al., 2018) attack to find z and solve the inner maximization problem. The objective function is the same as Eq 18.

A3T adopts a curriculum-based training method (Huang et al., 2019; Gowal et al., 2019) that uses a hyperparameter to weigh between normal loss and maximization objective in Eq. (2). We linearly increase the hyperparameter training.

Also, we set k in augmentto 2, which means we select 2 perturbed samples to abstract.

RQ2: Effects of size of the perturbation space In Figure 4, we fix the word-level model A3T (search) trained on . Then, we test this model’s exhaustive accuracy on and (Figure 4(b)), where we vary the parameters and between 1 and 4, increasing the size of the perturbation space. The exhaustive accuracy of A3T(HotFlip) and A3T(search) decreases by 17.4% and 11.4%, respectively, when increasing creases by 2.3% and 1.9%, respectively, when increasing from 1 to 4. All other techniques result in larger decreases in exhaustive accuracy (in and

In Figure 5, we fix the word-level model A3T (search) trained on . Then, we test this model’s exhaustive accuracy on (Figure 5(a)), (Figure 5(b)), and (Figure 5(c)), where we vary the parameters and between 1 and 3, increasing the size of the perturbation space. The exhaustive accuracy of A3T(HotFlip) and A3T(search) decreases by 1.1% and 0.9%, respectively, when increasing from 1 to 3, decreases by 12.9% and 6.9%, respectively, when increasing , and decreases by 1.4% and 0.9%, respectively, when increasing from 1 to 3. All other techniques result in larger decreases in exhaustive accuracy (in in , and in

Figure 4. The exhaustive accuracy of , varying the parameters (left) and (right) between 1 and 4.

Figure 5. The exhaustive accuracy of , varying the parameters (middle), and between 1 and 3.