Stance Detection (StD) represents a well-established task in natural language processing and is often described by having two inputs; (1) a topic of a discussion and (2) a comment made by an author. Given these two inputs, the aim is to find out whether the author is in favor or against the topic. For instance, in SemEval-2016 Task 6 (Mohammad et al., 2016), the second input is a short tweet and the goal is to detect, whether the author has made a positive or negative comment towards a given controversial topic:
The task has a long tradition in the domain of political and ideological online debates (Mohammad et al., 2016; Walker et al., 2012a; Somasundaran and Wiebe, 2010; Thomas et al., 2006). In recent years, it has been brought into the focus of attention by the uprising debates around fake news, where StD is an important pre-processing step (Pomerleau and Rao, 2017; Derczynski et al., 2017; Ferreira and Vlachos, 2016), as well as for other downstream tasks like argument search (Stab et al., 2018) and claim validation (Popat et al., 2017). As such, high performance in StD is a crucial step in successfully leveraging machine learning (ML) for argumentative information retrieval and fake news detection.
However, while humans are quite capable of assessing correct stances, ML models are often falling short of this task (see Table 1). As there are numerous domains to which StD can be applied, definitions of this task vary considerably. For instance, the first input can be a short topic, a claim, or sometimes is not given at all, while the second input can be another claim, an evidence, or even a full argument. Further, the second input can differ in length between a sentence, a short paragraph, and whole documents. The number of classes can also vary between 2-class problems (e.g. for/against)
Table 1: Inter-annotator agreement (IAA) vs. state-of-the-art results. ARC/FNC-1 in Fmacro, PERSPECTRUM in F
micro. *IAA in Hanselowski et al. (2018).
and more fine-grained 4-class problems (e.g. comment/support/query/deny). Moreover, the number of samples varies drasticially between datasets (for our setup: from 2,394 to 75,385). While these differences are problematic for cross-domain performance, it can also be seen as an advantage, as it concludes in an abundance of datasets from different domains that can be integrated into transfer or multi-task learning approaches. Yet, given the decent human performance on this task, it is hard to grasp why ML models fall short of StD, while they are almost on par at related tasks like Sentiment Analysis2 and Natural Language Inference3 (NLI).
Within this work, we provide foundations for answering this question. We empirically assess whether the abundance of differently framed StD datasets from multiple domains can be leveraged by looking at them in a holistic way, i.e. training and evaluating them collectively in a multi-task fashion. However, as we only have one task but multiple datasets, we henceforth define it as multi-dataset learning (MDL). And indeed, our model profits significantly from datasets of the same task via MDL with +4 percentage points (pp) on average, as well as from related tasks via transfer learning (TL) with +3.4pp on average.
However, while we gain significant performance improvements for StD by using TL and MDL, the expected robustness of these approaches is missing. We show this using a modified version of the Resilience score by Thorne et al. (2019) which reveals that TL and MDL models are even less robust than single-dataset learning (SDL) models. We investigate this phenomenon through low resource experiments and observe that less training data leads to an improved robustness for the MDL models, narrowing down the gap to the SDL models. We thus assume that lower robustness stems from dataset biases introduced by the vast amount of available training data for the MDL models, leading to over-fitting. Consequently, adversarial attacks that target such biases have a more severe impact on models that had more biased training data and overfitted on these biases.
The contributions of this paper are as follows: (1) To the best of our knowledge, we are the first to combine learning from related tasks (via TL) and MDL, designed to capture all facets of StD tasks, and achieve new state-of-the-art results on five of ten datasets. (2) In an in-depth analysis with adversarial attacks, we show that TL and MDL for StD generally improves the performance of ML models, but also drastically reduces their robustness if compared to SDL models. (3) To foster the analysis of this task, we publish the full benchmark system including model training and evaluation, as well as the means to add and evaluate adversarial attack sets and low resource experiments.1 All datasets, the fine-tuned models, and the machine translation models can be automatically downloaded and preprocessed for consistent future usage.
Stance Detection is a well-established task in natural language processing. Initial work focused on parliamentary debates (Thomas et al., 2006) and debating portals (Somasundaran and Wiebe, 2010), whereas latest work has shifted to the domain of Social Media, where several shared tasks have been introduced (Gorrell et al., 2019; Derczynski et al., 2017; Mohammad et al., 2016). With the shift in domains, the definition of the task also shifted: more classes were added (e.g. query (Gorrell et al., 2019) or unrelated (Pomerleau and Rao, 2017)), the number of inputs has changed (e.g. multiple topics for each sample (Sobhani et al., 2017)), or the definition of the inputs itself (e.g. from parliamentary speeches and debate portal posts to tweets (Gorrell et al., 2019), news articles (Pomerleau and Rao, 2017), or argument components (Stab et al., 2018; Bar-Haim et al., 2017)). In past years, the problem of StD has become a cornerstone for many downstream tasks like fake news detection (Pomer- leau and Rao, 2017), claim validation (Popat et al., 2017), and argument search (Stab et al., 2018). Yet, recent work mainly focuses on individual datasets and domains. We, in contrast, concentrate on a higher level of abstraction by aggregating datasets of different domains and definitions to analyze them in a holistic way. To do so, we leverage the idea of TL and multi-task learning (in form of MDL), as they have not only shown increases in performance and robustness (Ruder, 2017; Weiss et al., 2016), but also significant support in low resource scenarios (Schulz et al., 2018). Latest frameworks for multi-task learning include the one by Liu et al. (2019), which scored a new state-of-the-art on the GLUE Benchmark (Wang et al., 2018a). In contrast to their work, we will use the
Table 2: All datasets, grouped by domain and with examples. Topics in parentheses signal implicit information.
framework for MDL, i.e. combining only datasets of the same task to analyze whether StD datasets can benefit from each other by transferring knowledge about their domains. Furthermore, we probe the robustness of the learned models to analyze whether performance increases gained through TL and MDL are in accordance with increased robustness for StD.
Adversarial attacks describe test sets aimed to discover possible weak points of ML models. While much recent work in adversarial attacks aims to break NLI systems and is especially adapted to this problem (Glockner et al., 2018; Minervini and Riedel, 2018), these stress tests have been applied to a wide range of tasks from QuestionAnswering (Wang and Bansal, 2018) to Natural Machine Translation (Belinkov and Bisk, 2017) and Fact Checking (Thorne et al., 2019). Unfortunately, preserving the semantics of a sentence while automatically generating these adversarial attacks is difficult, which is why some works have defined small stress tests manually (Isabelle et al., 2017; Mahler et al., 2017). As this is time (and money) consuming, other work has defined heuristics with controllable outcome to modify existing datasets and to preserve the semantics of the data (Naik et al., 2018). In contrast to previous work, we use and analyze some of these attacks for the task of StD to probe the robustness of our SDL and MDL models.
We describe the dataset and models we use for the benchmark, the experimental setting, and the results of our experiments. For all experiments, we use and adapt the framework4 provided by Liu et al. (2019).
3.1 Datasets
We choose ten StD datasets from five different domains to represent a rich environment of different facets of StD. Datasets within one domain may still vary by their number of classes and sample sizes. All datasets are shown with an example and their domain in Table 2. In addition, Table 3 displays the split sizes and the class distributions of each dataset. All code to preprocess and split the datasets is available online.1 In the following, all datasets are introduced.
arc We take the version of the Argument Reasoning Corpus (Habernal et al., 2018) that was modified for StD by Hanselowski et al. (2018). A sample consists of a claim crafted by a crowdworker and a user post from a debating forum.
argmin The UKP Sentential Argument Mining Corpus (Stab et al., 2018) originally contains topicsentence pairs labelled with argument for, argument against, and no argument. We remove all non-arguments and simplify the original split: we train on the data of five topics, develop on the data of one topic, and test on the data of two topics.
fnc1 The Fake News Challenge dataset (Pomerleau and Rao, 2017) contains headline-article pairs from news websites. We take the original data without modifying it.
iac1 The Internet Argument Corpus V1 (Walker et al., 2012b) contains topic-post pairs from political debates on internet forums. We generate a new split without intersection of topics between train, development, and test set.
ibmcs The IBM Debater R- Claim Stance Dataset (Bar-Haim et al., 2017) contains topic-claim pairs. The topics are gathered from a debating database, the claims were manually collected from Wikipedia articles. We take the pre-defined train and test split and split an additional 10% off the train set for development.
perspectrum The PERSPECTRUM dataset (Chen et al., 2019) contains pairs of claims and related perspectives, which were gathered from debating websites. We only take the data they defined for the StD task in their work and keep the exact split. scd The Stance Classification Dataset (Hasan and Ng, 2013) contains posts about four topics from an online debate forum with all posts being selflabelled by the post’s author. The topics are not part of the actual dataset and have to be inferred from explicit or implicit mentions within a post. We generate a new data split by using the data of two topics for training, the data of one topic for development, and the data of the leftover topic for testing.
semeval2016t6 The SemEval-2016 Task 6 dataset (Mohammad et al., 2016) contains topic-tweet pairs, where topics are controversial subjects like politicians, Feminism, or Atheism. We adopt the same split as used in the challenge, but add some of the training data to the development split, as it originally only contained 100 samples.
semeval2019t7 The SemEval-2019 Task 7 (Gor- rell et al., 2019) contains rumours from reddit posts and tweets towards a variety of incidents like the Ferguson Unrest or the Germanwings crash. Similar to the scd dataset, the topics are not part of the actual dataset.
snopes The Snopes corpus (Hanselowski et al., 2019) contains data from a fact-checking website5 documenting (amongst others) rumours, evidence texts gathered by fact-checkers, and the documents from which the evidence originates. Besides labels for automatic fact-checking of the rumours, the corpus also contains stance annotations towards the rumours for some evidence sentences. We extract these pairs and generate a new data split.
3.2 Models
We experiment on all datasets in an SDL setup, i.e. training and testing on all datasets individually, and in an MDL setup, i.e. training on all ten StD datasets jointly. For this, we use the framework by Liu et al. (2019), as it provides the means to do both SDL and MDL. The SDL is based on the BERT architecture (Devlin et al., 2018) and simply adds a dense layer on top for the classification. The MDL is also based on the BERT architecture, but each dataset has its own dataset-specific dense layer on top. While the layers of the BERT architec-
Figure 1: Models and their relation. Arrows symbolize training, their labels state the used training data.
ture are shared, the dataset-specific layers are updated for each dataset individually at training time. All datasets are batched and fed through the architecture in a random order. As initial weights for SDL and MDL, we use either the pre-trained BERT (large, uncased) weights by Devlin et al. (2018) or the MT-DNN (large, uncased) weights by Liu et al. (2019). The latter uses the BERT weights and is fine-tuned on all datasets of the GLUE Benchmark (Wang et al., 2018a). By using the MT-DNN, we transfer knowledge from all datasets of the GLUE Benchmark to our models, i.e. apply TL in the form of pre-training. Henceforth, we use SDL and MDL to define the model architecture, and BERT and MT-DNN to define the pre-trained weights of the model architecture. This leaves us with four combinations of models: BERTMT-DNN
(see Figure 1).
3.3 Experimental Setting
For all experiments in this section, we set the batch size to 16, the number of epochs to 5, and we cut each input on 100 sub-words due to hardware limitations. Preliminary tests with the fnc1 dataset, which contains documents as one of the inputs, showed a minor drop in Fmacro of less than 2pp when reducing the sequence length from 300 to 100. To compensate for variations in the results, we train over five different fixed seeds and report the averaged results. We run all experiments on a Tesla P-100 with 16 GByte of memory. One epoch with all ten datasets takes around 1.5h. We use the splits for training, development, and testing as shown in Table 3. The table also lists the classes and class distribution for each dataset. We use the F
macro (F
) as a general metric, since the class balance for most datasets is skewed. The dataset training sizes vary from approx. 42,500 to as low as 935 samples.
Table 3: Splits, classes, and class distributions for all used datasets.
3.4 Results
We report the results of all models and datasets in Table 4. The last column shows the averaged Ffor a row. We make three observations: (1) TL from related tasks improves the overall performance, (2) MDL with datasets from the same task shows an even larger positive impact, and (3) TL, followed by MDL, can further improve on the individual gains shown by (1) and (2).
We show (1) by comparing the models BERTand MT-DNN
, where a gain of 3.4pp due to TL from the GLUE datasets can be observed. While some datasets show a drop in performance, the average performance increases. We show (2) by comparing BERT
to BERT
(+4pp) and MT-DNN
to MTDNN
(+1.8pp). The former comparison indicates that learning from similar datasets (i.e. MDL) has a higher impact than TL for StD. The latter comparison leads to observation (3); combining TL from related tasks (+3.4pp) and MDL on the same task (+4pp), can result in considerable performance gains (+5.1pp). However, as the individual gains from TL and MDL do not add up, it also indicates an information overlap between the datasets of the GLUE benchmark and the StD datasets. Lastly, while BERT
already outperforms five out of six state-of-the-art results, our BERT
DNN
are able to add significant performance increases on top.
As the robustness of an ML model is crucial if applied to other domains or in downstream applications, we analyze this feature in more detail. First, we define adversarial attacks to probe for weaknesses in the models. Second, we investigate the reason for detected weaknesses and a surprising anomaly in robustness between SDL and MDL models.
4.1 Adversarial Attacks: Definition
We investigate how robust the trained models are and whether TL from related tasks and MDL influ-ence this property. Inspired by stress tests for NLI, we select three adversarial attacks to probe the robustness of the models and modify all samples of all test sets with the following configurations: Paraphrase We paraphrase all samples of the test sets. For this, we lean on the work of Mallinson et al. (2017) and train two machine translation models with OpenNMT (Klein et al., 2017): one that translates English originals to German and another one that backtranslates. Spelling Spelling errors are quite common, especially in data from social media or debating forums. We add two errors into each input of a sample (Naik et al., 2018): (1) we swap two letters of a random word and (2) for a different word, we substitute a letter for another letter close to it on the keyboard. We only consider words with at least four letters, as shorter ones are mostly stopwords. Negation We use the negation stress test proposed by Naik et al. (2018). They add the tautology “and false is not true” after each sentence, as they suspect that models might be confused by strong negation words like “not”. We assume the same is also valid for StD. We add the tautology at the beginning of each sentence, since we truncate all inputs to a maximum length of 100 sub-words.
To measure the effectiveness of each adversarial attack , we calculate the potency score introduced by (Thorne et al., 2019) as the aver-
Table 4: Results of experiments on all datasets in Fmacro) and original paper metrics in parentheses (F
micro), Accuracy (Acc), Fake News Challenge score (FNC1), F
macro without class none (F
none)).
TalosComb (Hanselowski et al., 2018);
ESIM w/ GRU + Dropout (Jiang, 2019);
Ranking-MLP (Zhang et al., 2018);
Unigrams SVM (Bar-Haim et al., 2017);
Popat et al., 2019);
2018);
GPT-based (Yang et al., 2019).
age reduction from a perfect score and across the systems
with cbeing the ratio of correctly transformed samples (test to adversarial) and a function f that returns the performance score for a system s on an adversarial attack set a.
The correct rate cis calculated by taking 25 randomly selected samples from all test sets and comparing them to their adversarial counterpart. For the paraphrase attack, the first author checked whether the paraphrased and original sentences are semantically equal. We find that in 63% of the samples this is the case. This low result is mostly due to the three outlier datasets fnc1 (36%), snopes (36%), and arc (44%). Leaving out these three, 82% of the sentences are semantically correct paraphrases. As the changes through the spelling attack are minor and subjective to evaluate, we use the FleschKincaid grade level (Kincaid et al., 1975) to compare the readability of the original and adversarial sentences and label a sample as incorrectly translated if the readability of the adversarial sentence requires a higher U.S. grade level. For the negation attack samples, we assume a correctness of 100% (
) as the perturbation adds a tautology and the semantics and grammar are preserved.
Table 5: Potency of all adversarial attacks.
Table 6: Influence of adversarial attacks, averaged over all datasets on the BERTand MT-DNN
model (in F
and relative to the score on the test set).
4.2 Adversarial Attacks: Results and Discussion
We choose to limit the compared systems to BERTand MT-DNN
, as the latter uses both TL from related tasks and MDL, whereas the former uses neither. The potencies for all attack sets are shown in Table 5 and ranked by the raw potency which assumes all adversarial samples to be correct (i.e.
). The results on the adversarial attack sets for both the SDL and MDL model are shown in Table 6.
The paraphrasing attack has the lowest raw potency of all adversarial sets and the average scores only drop by about 2.8-4.7%. Interestingly, on the datasets that turned out to be difficult to paraphrase (fnc1, arc, snopes), the score on the MT-DNNonly drops by about 5.7%, 6.4%, and 6.5% (see Appendix, Table 9), which is not much below average. This confirms Niven and Kao (2019) in that the BERT architecture, despite contextualized word embeddings, also primarily focuses on certain cue words and the semantics of the whole sentence is not the main criterion.
With raw potencies of 41.1% and 43.3%, the negation and spelling attacks have the highest negative influence on both SDL and MDL (4.3% to 13.9% performance loss). We assume this to be another indicator that the models rely on certain key words and fail if the statistical occurrence of these words in the seen samples is changed. This is easy to see for the negation attack, as it adds a strong negation word. For the spelling attack, we look at the following original example from the perspectrum dataset:
And the same example as spelling attack:
Since all words of the original sample are in the vocabulary, Google’s sub-word implementation WordPiece (Wu et al., 2016) does not split the tokens into sub-words. However, this is different for the perturbed sentence, as, for instance, the tokens “esaier” and “oarents” are not in the vocabulary. Hence, we get [esa, ##ier] and [o, ##are, ##nts]. These pieces do not carry the same meaning as before the perturbation and the model has not learned to handle them.
However, the most surprising observation represents the much higher relative drop in scores between the test and adversarial attack sets for MT-DNNas compared to BERT
. MDL should produce more robust models and support them in handling at least some of these attacks, as some of the datasets originate from Social Media and debating forums, where typos and other errors are quite common. On top of that, the model sees much more samples and should be more robust to paraphrased sentences. Hence, to further evaluate the robustness of the two systems, we leverage the resilience measure introduced by Thorne et al. (2019):
It defines the robustness of a model against all adversarial attacks, scaled by the correctness of the attack sets. Surprisingly, the resilience of both the MDL (59.9%) and SDL (58.5%) model are almost on par. The score, however, only considers the absolute performance on the adversarial sets, but not the drop in performance when compared to the
Table 7: ResilienceDNN
test set results. If, for instance, model A performs better than model B on the same test set, but has a higher drop in performance on the same adversarial set, model A should show a lower robustness and thus receive a lower resilience score. As the resilience score does not consider this, we adapt the equation by taking the performance of the test set t into account:
We calculate the score for all adversarial attacks separately, as well as the overall Resilienceobserve that the SDL model outperforms the MDL model in each case (see Table 7). For some datasets, the absolute F
of the MDL model even drops below that of the SDL model (see Appendix, Table 9). Our experiments show that performance-wise, we can benefit from MDL, but there is a high risk of drastic loss in robustness, which can cancel out the performance gains or, even worse, renders the model inferior in real-world scenarios.
4.3 Analysis of Robustness via Low Resource Experiments
To investigate the reasons why the MDL model shows a lower robustness than the SDL models on average, we conduct low resource experiments by training the MDL model and the SDL models on 10, 30, and 70% of the available training data. Dev and test sets are kept at 100% of the available data at all times and results are averaged over five seeds.
As is to be expected, the performance gap between BERTand MT-DNN
on the test set grows with less training data (see Table 8). Here, the MDL shows its strength in low resource setups (Schulz et al., 2018). Even more so, while the
Table 8: Train data ratio performance on the test set.
MDL model showed disencouraging performance w.r.t. adversarial attacks when trained on 100% of the data, we observe that with less training data, the MT-DNNreduces the difference in overall Resilience
to the BERT
from 3.8pp at 100% training data to 1.5pp at 10% training data (see Table 2b). As shown in Figure 2a, this is due to the MT-DNN
approaching the Resilience
of the BERT
against the negation and paraphrase attack.
Our analysis reveals that the amount of training data has a direct negative impact on model robustness. As most (if not all) datasets inevitably inherit the biases of their annotators (Geva et al., 2019), we assume this negative impact on robustness is due to overfitting on biases in the training data. Hence, less training data leads to less overfitting on these biases, which in turn leads to a higher robustness towards certain attacks that target these biases. For instance, the word “not” in the negation attack can be a bias that adheres to negative class labels (Niven and Kao, 2019). Likewise, an overall shift in the distribution of some words due to the paraphrase attack can interfere with a learned bias. We argue that spelling mistakes are unlikely to be learned as a bias for stance detection classes and the actual reason for the performance drop of the attack is due to the split of ungrammatical tokens into several sub-words (see section 4.2).
We introduced a StD benchmark system that combines TL and MDL and enables to add and evaluate adversarial attack sets and low resource experiments. We include ten StD datasets of different domains into the benchmark and found the combination of TL and MDL to have a significant positive impact on performance. In five of the ten used datasets, we are able to show new state-of-the-art results. However, our analysis with three adversarial attacks reveals that, contrary to what is expected of TL and MDL, they result in a severe loss of robustness on our StD datasets, with scores often dropping well below SDL performance. We investigate the reasons for this observation by conducting low resource experiments and conclude that one major issue is the overfitting on biases of vast amounts of training data in our MDL approach.
Reducing the amount of training data for both SDL and MDL models narrows down the robustness anomaly between these two setups, but also
(a) Difference in Resiliencebetween BERT
DNN
for all train data ratios.
Figure 2: Resilienceover different train data ratios.
lowers the test set performance. Hence, we recommend to develop methods that integrate de-biasing strategies into multi-task learning approaches—for instance, by letting the models learn which samples contain biases and should be penalized or ignored (Clark et al., 2019) to enhance the robustness, thus also being able to leverage more (or all) training data available to maintain the performance. We foster this work by publishing our dataset splits, models, and experimental code.
In the future, we plan to combine methods that cope with biased data (Clark et al., 2019; He et al., 2019) with MDL and to experiment with sampling methods which aim to reduce the training data to the samples that are necessary to learn the task (Prabhu et al., 2019; Ruder and Plank, 2017). In regard to adversarial attacks, we also aim to concentrate on task-specific adversarial attacks and use insights of adversarial attacks to build defences for the models (Pruthi et al., 2019; Wang et al., 2018b).
This work has been funded by the German Federal Ministry of Education and Research (BMBF) under the promotional reference 03VP02540 (ArgumenText).
Roy Bar-Haim, Indrajit Bhattacharya, Francesco Din- uzzo, Amrita Saha, and Noam Slonim. 2017. Stance classification of context-dependent claims. In EACL’17, pages 251–261.
Yonatan Belinkov and Yonatan Bisk. 2017. Synthetic and natural noise both break neural machine translation. arXiv preprint arXiv:1711.02173.
Sihao Chen, Daniel Khashabi, Wenpeng Yin, Chris Callison-Burch, and Dan Roth. 2019. Seeing things from a different angle:discovering diverse perspec- tives about claims. In NAACL’19, pages 542–557.
Christopher Clark, Mark Yatskar, and Luke Zettle- moyer. 2019. Don’t take the easy way out: En- semble based methods for avoiding known dataset biases. In EMNLP-IJCNLP’19, pages 4067–4080.
Leon Derczynski, Kalina Bontcheva, Maria Liakata, Rob Procter, Geraldine Wong Sak Hoi, and Arkaitz Zubiaga. 2017. Semeval-2017 task 8: Rumoureval: Determining rumour veracity and support for rumours. arXiv preprint arXiv:1704.05972.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
William Ferreira and Andreas Vlachos. 2016. Emer- gent: a novel data-set for stance classification. In NAACL’16, pages 1163–1168.
Mor Geva, Yoav Goldberg, and Jonathan Berant. 2019. Are we modeling the task or the annotator? an in- vestigation of annotator bias in natural language un- derstanding datasets. In EMNLP-IJCNLP’19, pages 1161–1166.
Max Glockner, Vered Shwartz, and Yoav Goldberg. 2018. Breaking nli systems with sentences that require simple lexical inferences. In ACL’18 (Volume 2: Short Papers), pages 650–655.
Genevieve Gorrell, Ahmet Aker, Kalina Bontcheva, Leon Derczynski, Elena Kochkina, Maria Liakata, and Arkaitz Zubiaga. 2019. Semeval-2019 task 7: Rumoureval, determining rumour veracity and support for rumours. In SemEval-2019, pages 845–854.
Ivan Habernal, Henning Wachsmuth, Iryna Gurevych, and Benno Stein. 2018. The argument reasoning comprehension task: Identification and reconstruction of implicit warrants. In NAACL’18, pages 1930– 1940.
Andreas Hanselowski, Avinesh PVS, Benjamin Schiller, Felix Caspelherr, Debanjan Chaudhuri, Christian M. Meyer, and Iryna Gurevych. 2018. A retrospective analysis of the fake news challenge stance-detection task. In COLING’18, pages 1859–1874.
Andreas Hanselowski, Christian Stab, Claudia Schulz, Zile Li, and Iryna Gurevych. 2019. A richly annotated corpus for different tasks in automated fact-checking. In CoNLL’19, pages 493–503.
Kazi Saidul Hasan and Vincent Ng. 2013. Stance clas- sification of ideological debates: Data, models, features, and constraints. In IJCNLP’13, pages 1348– 1356.
He He, Sheng Zha, and Haohan Wang. 2019. Unlearn dataset bias in natural language inference by fitting the residual. In DeepLo’19, pages 132–142.
Pierre Isabelle, Colin Cherry, and George Foster. 2017. A challenge set approach to evaluating machine translation. arXiv preprint arXiv:1704.07431.
Yan Jiang. 2019. Using machine learning for stance detection. Master’s thesis, The University of Texas at Austin, 1.
J. Peter Kincaid, Robert P. Fishburne Jr., Richard L. Rogers, and Brad S. Chissom. 1975. Derivation of new readability formulas (automated readability index, fog count and flesch reading ease formula) for navy enlisted personnel. Institute for Simulation and Training, University of Central Florida.
Guillaume Klein, Yoon Kim, Yuntian Deng, Jean Senel- lart, and Alexander M. Rush. 2017. OpenNMT: Open-source toolkit for neural machine translation. In ACL’17, pages 67–72.
Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- feng Gao. 2019. Multi-task deep neural networks for natural language understanding. In ACL’19, pages 4487–4496.
Taylor Mahler, Willy Cheung, Micha Elsner, David King, Marie-Catherine de Marneffe, Cory Shain, Symon Stevens-Guille, and Michael White. 2017. Breaking NLP: Using Morphosyntax, Semantics, Pragmatics and World Knowledge to Fool Sentiment Analysis Systems. EMNLP’17, pages 33–39.
Jonathan Mallinson, Rico Sennrich, and Mirella Lapata. 2017. Paraphrasing revisited with neural machine translation. In EACL’17, pages 881–893.
Pasquale Minervini and Sebastian Riedel. 2018. Ad- versarially regularising neural nli models to integrate logical background knowledge. In CoNLL’18, pages 65–74.
Saif Mohammad, Svetlana Kiritchenko, Parinaz Sob- hani, Xiaodan Zhu, and Colin Cherry. 2016. Semeval-2016 task 6: Detecting stance in tweets. In SemEval-2016, pages 31–41.
Aakanksha Naik, Abhilasha Ravichander, Norman Sadeh, Carolyn Rose, and Graham Neubig. 2018. Stress test evaluation for natural language inference. In COLING’18, pages 2340–2353.
Timothy Niven and Hung-Yu Kao. 2019. Probing neu- ral network comprehension of natural language argu- ments. In ACL’19, pages 4658–4664.
Dean Pomerleau and Delip Rao. 2017. The Fake News Challenge: Exploring how artificial intelligence technologies could be leveraged to combat fake news. http://www.fakenewschallenge. org/. [Online; accessed 06-January-2020].
Kashyap Popat, Subhabrata Mukherjee, Jannik Str¨otgen, and Gerhard Weikum. 2017. Where the truth lies: Explaining the credibility of emerging claims on the web and social media. In WWW’17, pages 1003–1012.
Kashyap Popat, Subhabrata Mukherjee, Andrew Yates, and Gerhard Weikum. 2019. STANCY: Stance clas- sification based on consistency cues. In EMNLPIJCNLP’19, pages 6412–6417.
Ameya Prabhu, Charles Dognin, and Maneesh Singh. 2019. Sampling bias in deep active classification: An empirical study. In EMNLP-IJCNLP’19, pages 4056–4066.
Danish Pruthi, Bhuwan Dhingra, and Zachary C. Lip- ton. 2019. Combating adversarial misspellings with robust word recognition. In ACL’19, pages 5582– 5591.
Sebastian Ruder. 2017. An overview of multi-task learning in deep neural networks. arXiv preprint arXiv:1706.05098.
Sebastian Ruder and Barbara Plank. 2017. Learning to select data for transfer learning with Bayesian opti- mization. In EMNLP’17, pages 372–382.
Claudia Schulz, Steffen Eger, Johannes Daxenberger, Tobias Kahse, and Iryna Gurevych. 2018. Multi-task learning for argumentation mining in low-resource settings. In NAACL’18, pages 35–41.
Parinaz Sobhani, Diana Inkpen, and Xiaodan Zhu. 2017. A dataset for multi-target stance detection. In EACL’17, pages 551–557.
Swapna Somasundaran and Janyce Wiebe. 2010. Rec- ognizing stances in ideological on-line debates. In NAACL-HLT’10, pages 116–124.
Christian Stab, Tristan Miller, Benjamin Schiller, Pranav Rai, and Iryna Gurevych. 2018. Cross-topic argument mining from heterogeneous sources. In EMNLP’18, pages 3664–3674.
Matt Thomas, Bo Pang, and Lillian Lee. 2006. Get out the vote: Determining support or opposition from congressional floor-debate transcripts. In EMNLP’06, pages 327–335.
James Thorne, Andreas Vlachos, Christos Christodoulopoulos, and Arpit Mittal. 2019. Evaluating adversarial attacks against multiple fact verification systems. In EMNLP-IJCNLP’19, pages 2944–2953.
Marilyn A Walker, Pranav Anand, Rob Abbott, Jean E Fox Tree, Craig Martell, and Joseph King. 2012a. That is your evidence?: Classifying stance in on-line political debate. Decision Support Systems, 53(4):719–729.
Marilyn A Walker, Jean E Fox Tree, Pranav Anand, Rob Abbott, and Joseph King. 2012b. A corpus for research on deliberation and debate. In LREC’12, pages 812–817.
Alex Wang, Amanpreet Singh, Julian Michael, Fe- lix Hill, Omer Levy, and Samuel Bowman. 2018a. GLUE: A multi-task benchmark and analysis plat- form for natural language understanding. In EMNLP’18 Workshop BlackboxNLP, pages 353– 355.
Derek Wang, Chaoran Li, Sheng Wen, Yang Xiang, Wanlei Zhou, and Surya Nepal. 2018b. Defensive collaborative multi-task training-defending against adversarial attack towards deep neural networks. arXiv preprint arXiv:1803.05123.
Yicheng Wang and Mohit Bansal. 2018. Robust ma- chine comprehension models via adversarial training. In NAACL’18, pages 575–581.
Penghui Wei, Wenji Mao, and Daniel Zeng. 2018. A target-guided neural memory model for stance de- tection in twitter. In IJCNN’18, pages 1–8.
Karl Weiss, Taghi M Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Journal of Big data, 3(1):9.
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.
Ruoyao Yang, Wanying Xie, Chunhua Liu, and Dong Yu. 2019. BLCU NLP at SemEval-2019 task 7: An inference chain-based GPT model for rumour evalu- ation. In SemEval-2019, pages 1090–1096.
Qiang Zhang, Emine Yilmaz, and Shangsong Liang. 2018. Ranking-based method for news stance de- tection. In WWW’18, pages 41–42.
A.1 Adversarial Attacks on Stance Detection Models
Table 9 shows the absolute performance scores of MT-DNN(all datasets with subscript MDL) and BERT
(all datasets with subscript SDL). All absolute scores are in F
macro. The numbers in parentheses in the Avg. column represent the relative drop to the respective score on the test set. Bold numbers in a column represent the best score between the MDL and SDL on an adversarial attack set.
Table 9: Comparison of MT-DNN(all datasets with subscript
(all datasets with subscript SDL). All absolute scores are in F