Computational Argumentation is a rapidly emerging discipline within the Natural Language Processing community (Reed 2016), dealing with various sub-tasks such as argument detection (Lippi and Torroni 2016; Ein-Dor et al. 2019), stance detection (Bar-Haim et al. 2017) and argument clustering (Reimers et al. 2019).
Recently, IBM introduced Project Debater, the first AI system able to debate humans on complex topics. The system participated in a live debate against a world champion debater, and was able to mine arguments, use them for composing a speech supporting its side of the debate, and also rebut its human competitor.1 The underlying technology is intended to enhance decision-making.
More recently, IBM also introduced Speech by Crowd, a service which supports the collection of free-text arguments from large audiences on debatable topics to generate meaningful narratives (Toledo et al. 2019). An important sub-task of this service is automatic assessment of argument quality, which is the focus of the present work. Detecting argument quality is a prominent task due to its importance in automated decision making (Bench-Capon, Atkinson, and McBurney 2009), argument search (Wachsmuth et al. 2017b), and writing support (Stab and Gurevych 2014).
Earlier work on assessing argument quality relied on a comparative pair-wise approach, aiming to identify the higher-quality argument within each pair of arguments (Habernal and Gurevych 2016; Simpson and Gurevych 2018). Recently Toledo et al. (2019) proposed a point-wise argument quality prediction scheme, which scales linearly with data size. Correspondingly, here we focus on this paradigm which is clearly less demanding, especially in scenarios where many arguments should be considered.
A major contribution of this work is introducing a novel dataset of arguments, carefully annotated for point-wise quality, IBM-ArgQ-Rank-30kArgs, referred henceforth as IBM-Rank-30k. The dataset includes around 30k arguments, 5 times larger than the largest annotated point-wise data released to date (Toledo et al. 2019). Similarly to Toledo et al. (2019), arguments were collected actively – as opposed to being extracted from debate portals (Habernal and Gurevych 2016) – with strict length limitations, accompanied by extensive quality control measures.
Although a continuous argument quality score seems natural, asking annotators to provide a continuous score per argument will probably introduce a subjective scale that varies from one annotator to another, hindering downstream analysis. Instead, we take a simplified approach, asking each annotator to answer a binary question per argument, indicating if its quality is satisfactory in a particular context. The question remains, of how to extract a continuous quality score out of the binary annotations provided by many annotators. Toledo et al. (2019) took a straightforward simple-average approach; here, we provide an extensive comparison between potential scoring functions, analyzing the differences between these models, and their impact on training and evaluating a learning algorithm.
While an exact definition of argument quality is potentially elusive, it seems clear that it is a function of various linguistic phenomena. As exemplified in Table 1, low quality can be manifested by dimensions such as bad grammar and low clarity (Row 3), or lack of impact and relevance (Row 4). In contrast, high quality arguments are typically clear, relevant, and with high impact (Rows 1 and 2). A model that aims to automatically infer argument quality should take such subtleties into account. A recent development in this context is that of deep contextual language models, such as ELMo (Peters et al. 2018) and BERT (Devlin et al. 2018). Due to its bidirectional nature, BERT achieves remarkable results when fine-tuned to different tasks without the need for specific modifications per task. As part of this work, we introduce various neural methods that exploit the value of BERT for our task. In particular, we suggest a model that outperforms several baselines on our data. Our experimental results further indicate that this method is either comparable to or outperforms state-of-the-art methods on previously released data.
Table 1: Examples of high (rows 1-2) and low (rows 3-4) quality arguments from the IBM-Rank-30k dataset. The label is a weighted aggregation of annotations.
The main contributions of our work are: (1) Introducing a carefully annotated argument quality dataset which is the largest of its kind; (2) Conducting extensive analysis of different approaches to induce a quality label from given binary annotations; (3) Proposing a BERT-based method to predict argument quality, and report the results of extensive experiments that convey the potential of this method.
Assessing argument quality is a long standing challenge. For centuries there has been a multi-disciplinary effort to define and research aspects of quality in argumentation (Aristotle, Kennedy, and Kennedy 1991; Walton, Reed, and Macagno 2008; Perelman and Olbrechts-Tyteca 1969). A core issue in the field is the presumed subjectivity of the task at hand. There have been several practical and theoretical approaches on how to overcome the supposed lack of objectivity.
Swanson, Ecker, and Walker (2015) approach argument quality as a point-wise ranking task, with the goal of selecting argument segments that clearly express an argument facet in a given dialogue. Arguments are labeled by a real value in the range of [0, 1], where a score of 1 indicates that an argument can be easily interpreted. They then develop an automatic regression method using these labels. Their corpus, which we refer to henceforth as SwanRank, contains 5.3k labeled arguments.
An alternative approach to assess arguments is to focus on their relative convincingness, by comparing pairs of arguments with similar stance. This approach is introduced in Habernal and Gurevych; Simpson and Gurevych (2016; 2018), and further assessed in Potash, Bhattacharya, and Rumshisky (2017), Gleize et al. (2019), and Potash, Ferguson, and Hazen (2019). As part of their work, Habernal and Gurevych (2016) introduce two datasets: UKPConvArgRank (henceforth, UKPRank) and UKPConvArgAll, which contain 1k and 16k arguments and argument-pairs, respectively. In their work, the point-wise scores are induced from the pair-wise labels, rather than annotated directly. Gleize et al. (2019) focus of ranking convincingness of evidence. Their solution is based on a Siamese neural network, which outperforms the results achieved in Simpson and Gurevych (2018) on the UKP datasets, as well as several baselines on their own dataset, IBM-ConvEnv. Potash, Bhattacharya, and Rumshisky (2017) present a method that is based on representing an argument by the sum of its token embeddings, extended in Potash, Ferguson, and Hazen (2019) to include a Feed Forward Neural Network. These two works outperform Simpson and Gurevych (2018) on the UKP datasets for both the pair-wise and point-wise tasks, respectively.
Durmus, Ladhak, and Cardie (2019) present a new dataset comprised of over 47k claims in 471 topics from the website kialo.com, aimed at evaluating the effect of pragmatic and discourse context when determining argument quality. They propose models to predict the impact value of each claim, as determined by the users of the website. Their dataset is somewhat different from ours as it focuses on argument impact, rather than overall quality, and doing so in the context of an argumentative structure, instead of independently. In addition, their impact values are based on spontaneous input from users of the website, whereas our dataset was carefully annotated with clear guidelines. Still, it further highlights the importance of this field.2
Toledo et al. (2019) consider both point-wise and pair-wise quality approaches, as well as the interaction between them. They introduce two datasets: IBMRank, which contains 5.3k point-wise labeled arguments (6.3k before cleansing) and IBMPairs, which contains 9.1k labeled argument-pairs (14k before cleansing). Arguments in their datasets, collected in the context of Speech by Crowd experiments, are suited to the use-case of civic engagement platforms, giving premium to the usability of an argument in oral communication. Our dataset differs in three respects: (1) our dataset is larger by a factor of 5 compared to previous datasets annotated for point-wise quality; (2) our data were collected mainly from crowd contributors that presumably better represent the general population compared to targeted audiences such as debate clubs; (3) we performed an extensive analysis of argument scoring methods and introduce superior scoring methods that consider annotators credibility without removing them entirely from the labeled data, as is done in Toledo et al. (2019).
In the next two sections, we present the creation of the IBM-Rank-30k dataset. First, we describe the process of argument collection. We then move to describing how arguments were annotated for quality. Finally, we discuss and analyze how we derive point-wise continuous quality labels from binary annotations, by conducting a comprehensive comparison between different scoring functions. We release this dataset as part of this work.3
3.1 Argument Collection
For the purpose of collecting arguments for the IBM-Rank-30k dataset, we conducted a crowd annotation task, using the Figure Eight platform.4 A small portion of arguments (8.6%) was also collected from expert annotators who work closely with our team. We selected 71 common controversial topics, for which arguments were collected (e.g., We should abolish capital punishment).
We follow similar guidelines to that presented in Toledo et al. (2019). Annotators were presented with a single topic each time, and asked to contribute one supporting and one contesting argument for it, requiring arguments to be written using original language. To motivate high-quality contributions, contributors were informed they will receive extra payment for high quality arguments, as determined by the subsequent argument quality labeling task (Section 3.2). It was explained that an argument will be considered as a high-quality one, if a person preparing a speech on the topic will be likely to use this argument as is in her speech.
Similarly to Toledo et al. (2019), we place a limit on argument length - a minimum of 35 characters and a maximum of 210 characters. In total, we collected 30,497 arguments from 280 contributors, each contributing no more than 6 arguments per topic.
3.2 Argument Quality Annotations Collection
In this section we describe the argument quality annotation process, performed for all collected arguments. As above, we used the Figure Eight platform, with 10 annotators per argument. Following Toledo et al. (2019), annotators were presented with a binary question per argument, asking if they would recommend a friend to use that argument as is in a speech supporting/contesting the topic, regardless of personal opinion. In addition, annotators were asked to mark the stance of the argument towards the topic (pro or con).
To monitor and ensure the quality of the collected annotations, we employed the following measures introduced in Toledo et al. (2019):
Test Questions. Before the labeling of 1/5 of the arguments, a hidden test question about the stance of the argument towards the topic was presented, aimed to verify the annotator is reading the argument carefully. Annotators that failed more than 20% of the test questions were removed from the task, and their judgments were ignored. Typically of the contributors were removed from each sub-task due to this reason.
Annotator-reliability score. Defined in Toledo et al. (2019) as the Annotator-for a single score (and taskaverage-
averaged on all valid scores) and denoted here as Annotator-Rel. This score was used both to monitor tasks in real time, and as a basis for the weighted score function described in Section 4. It is obtained by averaging all pair-wise
for a given annotator, with other annotators that share at least 50 common judgements. Annotators who do not share at least 50 common judgments with at least 5 other annotators, do not receive a value for this score.
The average of valid annotators reliability scores on the quality annotations is 0.12. This task reproduces the task from Toledo et al. (2019), and as established there, such an average is acceptable due to the subjectivity of the task.5 Among other things, Toledo et al. (2019) rely on the taskaverage-of the stance annotations in the same task, which was 0.69 in that work, and here it is 0.83.
To further ensure high quality annotations, rather than introducing the task to any crowd worker, we used a selected group of 600 crowd annotators, which had high Annotatorreliability in past tasks of our team.
As mentioned in section 3.2, for each argument several annotators answered a binary question regarding its quality. We chose this format to simplify the annotation process of this subjective question, aiming to avoid an additional subjective element - the scale. However, we still need to provide a single continuous score per argument, reflecting its quality, that can be compared with the scores of other arguments. To that end, we evaluate approaches to derive such a quality score, on a bounded scale, from a set of binary annotations.
4.1 Quality Scoring Functions
First, we describe two scoring functions. Each function provides the likelihood of the positive label, between 0 (bad quality) and 1 (good quality).
MACE probability (MACE-P) - Habernal and Gurevych (2016) suggested MACE (Hovy et al. 2013) as a scoring function for the quality of an argument based on crowd annotations. MACE is an unsupervised item-response generative model which predicts the probability for each label given the annotations. MACE also estimates a reliability score for each annotator which it then uses to weigh this annotator’s judgments. We use the probability MACE outputs for the positive label as the MACE-P scoring function.
Weighted-Average (WA) - We suggest to use a weighted-average score to incorporate annotator-reliability, in the spirit of MACE. This is designed to decrease the influence of non-reliable annotators on the final quality score, thus providing an intuitive and gradual form of data cleansing. For each argument a, we define as the set of annotators who labeled it as positive, and
as the set of annotators who labeled it as negative. The WA score of an argument a is de-
fined as:
There is a clear distinction between the distribution of scores obtained by WA and MACE-P scoring functions (see Figure 1). WA outputs values close to 0 or 1 only if there is a strong annotation consensus. As generally there are more positive annotations in our data, the histogram of this scoring function is skewed towards 1, with an almost linear decrease. On the other hand, MACE assigns probabilities to both labels. As a result, the quality scores lean strongly to both extreme values, creating a U-shaped histogram.
0.0 0.00.5 0.51.0 1.0 0
Figure 1: Histogram of arguments according to MACE-P and WA quality scores. X axis - quality scores. Y axis - counts of arguments in IBM-Rank-30k.
4.2 Comparing Scoring Functions
Next, we compare the quality scoring functions described above via the following three experiments.
Disagreements in choosing the better argument We first ask which of the two methods is ‘correct’ more often when examining the cases in which given a pair of arguments, they disagree on which argument should get a higher score. We created all possible argument pairs from IBM-Rank-30k, such that arguments in the same pair are taken from the same topic and have the same stance towards it (to avoid annotator bias).6
We focus on the set of pairs in which the two methods disagree on the preferred argument, i.e. the one that received a higher score (about 20% of the pairs). We sample 850 such disagreement pairs and send them for pair-wise annotation, i.e., asking the annotators to choose the better of the two arguments. Each pair was annotated by 7 annotators from our selected group of crowd annotators. We discarded pairs for which the agreement between annotators was less than 70% (this filtered out 27% of the pairs). In 55% of the remaining pairs the annotators chose the argument preferred by MACE-P, making this method somewhat more correlated with the pair-wise judgments. Both methods were also compared to simple-average, the scoring method applied in Toledo et al. (2019). In 61% of the pairs that differ between simple-average and MACE-P, annotators chose the argument preferred by the latter. In 59% of the pairs that differ between simple-average and WA, annotators chose the argument preferred by WA. As the simple-average method was found inferior, it was omitted from subsequent evaluation experiments.
Agreement with pair-wise annotation In this experiment, we present each scoring function with pairs of arguments, and compare their preferred arguments with a gold standard obtained by pair-wise annotation, following the consistency evaluation performed in Toledo et al. (2019). Similar to the previous experiment, we generate all argument pairs (of the same topic and stance). Next, we bin all pairs into four sets by the size of the delta between the scores of their arguments (e.g., all pairs with score difference between 0.25 and 0.5). From each such delta bin we randomly sample 150 pairs and send them for pair-wise annotation as described in the previous experiment. We repeat this process for both scoring functions.
We calculate the precision of each scoring function in each delta bin, based on its agreement with the pair-wise annotation. Table 2 depicts the results for MACE-P, and Table 3 for WA. The results show that as the score difference increases (larger delta bin), precision also increases (for both scoring functions). Interestingly, this tendency is more prominent for WA, reaching a perfect match when the difference in point-wise quality scores is higher than 0.75.7 The tables also depict the percent of pairs filtered due to low (less than 70%) agreement between annotators. Interestingly, in the 3 bins with more than 0.25 difference in point-wise scores, less pairs are filtered when using WA, suggesting that when WA is confident about which argument is better, annotators also tend to agree more often, which is not the case for MACE-P.
Table 2: Comparing MACE-P scoring function preference to gold standard of pair-wise annotation.
Split annotations consistency A desirable property for a scoring function is that it will be relatively consistent with respect to different sets of annotators; namely, if we split the binary annotations into two sets and construct the continuous annotation from each set independently, we will end up with approximately the same score.
Table 3: Comparing WA scoring function preference to gold standard of pair-wise annotation.
To examine that, we randomly split the 10 binary annotations we have for each argument into two sets (each containing 5 annotations), similar to Habernal et al. (2018). We then calculate two quality scores for each argument, based on these two smaller annotation sets, respectively. While Habernal et al. (2018) utilized agreement to measure consistency, we can measure correlation between the two continuous scores.
We note that since each score was calculated on a set which is half the size of the original annotation set, we have less information for assessing the reliability score of each annotator. This fact can harm the accuracy of the scoring functions, as they both utilize annotators reliability scores. Nevertheless, for both scoring functions we considered, we find a good correlation between the scores calculated over the two smaller annotation sets. WA achieves 0.42 and 0.36 Pearson and Spearman correlations, respectively. MACE-P achieves 0.42 in both.
In summary, we point out a key difference between the two scoring functions: the tendency of WA to present a gradual continuous scale, as opposed to MACE-P, which aims at discovering the ‘true’, hence the binary, labels. For this reason we tend to prefer WA as a scoring function for this task, which is inherently deriving a non-binary score. However, as our experiments do not show a clear preferred function, we utilize both for the evaluation of neural methods in Section 6.1. For brevity, in the analyses in sections 5 and 7 we only use WA scores. Finally, in the dataset we release as part of this work, we include the quality scores of both scoring functions. We conclude that the induction of a quality score on top of existing annotations is not necessarily trivial, and should be carefully considered, as it impacts the scores distribution and the performance of learning algorithms trained on the score.
We seek to further explore the consistency and accuracy of the IBM-Rank-30k dataset. To that end, we employ a quality dimensions model, i.e. a model that decomposes a holistic quality score to several dimensions. Such a model was suggested by Wachsmuth et al. (2017a) - a meta-model that was created via a broad literature survey on previous quality models. They decompose quality to 15 sub-dimensions that aim to determine fine-grained properties of argument quality.
We conducted an annotation task similar in spirit to Wachsmuth et al. (2017a), in the context of the IBM-Rank-30k dataset. Our goal is to quantitatively assess the reasons for arguments having low or high quality, through the prism of this theoretical model. First, we decided to exclude 5 dimensions from this task, as they lack high potential to embody relevant characteristics over our data.8 We split the IBM-Rank-30k dataset to 5 equally populated bins according to the WA quality scores (1-5, where 1 is the lowest quality bin), and randomly sample 100 arguments with a uniform distribution over the bins. Each argument was labeled by 3 expert annotators that have extensive background in related tasks of our team. The annotators were not aware of the original quality bins. We have asked the annotators to annotate each argument according to each of the 10 dimensions on a scale of 1-3, to stay consistent with the scale offered by Wachsmuth et al. (2017a), and calculate the average for each dimension.
We observe that even though the task is complex, across all dimensions, the arguments from the highest bin achieve a higher average over the middle bins, and arguments from the middle bins achieve a higher score over the lower bins. The 2 dimensions that present the largest difference between bins 5 and 1 are Global Relevance and Effectiveness, with an average difference of 0.72 and 0.64, respectively. Global Relevance asks whether an argument provides information that helps to arrive at an ultimate conclusion regarding the discussed issue, while Effectiveness asks whether the argument is effective in helping to persuade in the author’s stance. These results suggest that the differences between low and high arguments in the IBM-Rank-30k dataset are best explained by how related the annotator found the argument to the topic, and how effectively the argument was presented. Such results corroborate several notions. Firstly, our quality labeling is broadly consistent with a known quality model. Moreover, the outcome serves as an initial proof of concept of decomposing quality in a large dataset to gain explainabilty. The detection of quality dimensions can be used in a range of applications, from more exact research on what determines quality, to feedback systems that can recommend users more precisely what they need to improve to advance their argumentation skills.
We now move to apply the IBM-Rank-30k dataset to the task of learning to rank the quality of arguments. For this purpose, we evaluate the following methods, which include several neural methods, as well as some simpler baselines. Arg-Length. Although we placed a strong limit on argument length, it is possible that there is still a bias towards longer arguments. To inspect this, we evaluate a ranking baseline based on an argument’s length in characters. SVR BOW. We evaluate a Support Vector Regression ranker, implemented by the scikit-learn toolkit,9 with an RBF kernel and bag-of-words features, using the most frequent 1000 tokens in the training set. Bi-LSTM GloVe. As a simple neural baseline we implement a Bi-LSTM model with self-attention, following the
model used in Levy et al. (2018). The model was trained with a dropout of 0.15, an LSTM layer of size 128 and an attention layer of size 100. For input features we used the 300 dimensional GloVe embeddings (Pennington, Socher, and Manning 2014). We use the following three methods based on BERT: BERT-Vanilla (henceforth, BERT-V). In this network,
for each argument, we concatenate the last 4 layers of the
[CLS] token obtained from BERT’s pre-trained model, resulting in a feature vector of size . The features are passed through a fully-connected hidden layer of size 100 with ReLU activation, after which we apply a sigmoid activation layer with a single output. BERT-Finetune (henceforth, BERT-FT). This method
fine-tunes BERT’s pre-trained model. The official code
repository of BERT10 supports fine-tuning to classification tasks, which is done by applying a linear layer on the [CLS] token of the last layer of BERT’s model, which is then passed through a soft-max layer. The weights of the preceding layers are initialized with BERT’s pre-trained model, and the entire network is then trained on the new data. To adapt the fine-tuning process to a regression task, the following were performed: (1) Changing the label type to represent real values instead of integers; (2) Replacing the soft-max layer with a sigmoid function, to support a single output holding values in the range of [0,1]; (3) Modifying the loss function to calculate the Mean Squared Error of the
logits compared to the labels.11
BERT-FTTOPIC. We also evaluate the addition of the topic to the input of BERT-FT. The topic is concatenated to the argument, separated by a [SEP] delimiter, and the model is fine-tuned as in BERT-FT.
6.1 Experiments on IBM-Rank-30k
For the purpose of evaluating our methods on the IBM-Rank-30k dataset, we split its 71 topics to 49 topics for training, 7 for tuning hyper-parameters and determining early stopping (dev set) and 15 for test.
We present results for both WA and MACE-P scoring functions, aiming to shed some more light on their properties. All models were trained for 5 epochs over the training data, taking the best checkpoint according to the performance on the dev set, with a batch size of 32 and a learning rate of 2e-5. We calculate Pearson (r) and Spearman () correlations on the entire test set.
For significance testing, we use the Williams test (Williams 1959) which evaluates the significance of a difference in dependent correlations (Steiger 1980). The Williams test has been successfully used in Machine Translation in the evaluation of MT metrics (Graham and Baldwin 2014) and quality estimation (Graham 2015).
6.2 Results and Discussion
The results on the IBM-Rank-30k dataset are presented in Table 4. Using BERT-V improves on the Bi-LSTM GloVe method by .4-.6 points for Pearson correlation, and .2- .6 points for Spearman correlation. Fine-tuning BERT improves on BERT-V by .2-.4 points for both correlation measures. Adding the topic adds a statistically significant improvement of .1-.2 points (for both correlation measures and quality score methods). Interestingly, when using MACE-P scores, the model is able to achieve higher Spearman correlation for most methods. Also, argument length is not an indicator for quality, as evident by the poor performance of the Arg-Length baseline.
Table 4: Correlations on the IBM-Rank-30k test set.
An important property of a good model is that its performance increases when considering arguments that are on the extremes of the argument quality scale. To evaluate this, we define a cut-off percentile d as a view of the test data that considers only the top d and bottom d percent of the data. For each cut-off we calculate Pearson and Spearman correlations w.r.t the predictions of the BERT-FTTOPIC model. Figure 2 presents the correlations for cut-off percentiles ranging from 10% to 50% (50% is equivalent to taking the entire data), for the models trained on MACE-P and WA quality scores. A clear trend emerges in which the correlations increase as arguments are taken from a smaller percentile, i.e. from further extremes of the quality scale, reaching up to .71-.73 and .67 for Pearson and Spearman correlations, respectively, when considering only the bottom and top 10% of the test set for evaluation.
6.3 Experiments on SwanRank and UKPRank
In this section we demonstrate the strength of the two best methods – BERT-FT and BERT-FTTOPIC – by applying them on two related datasets, SwanRank and UKPRank. We chose SwanRank as it also contains direct point-wise quality labels. UKPRank was chosen as an established dataset which also contains point-wise quality scores. However, it should be noted that as opposed to the IBM-Rank-30k and SwanRank datasets, the point-wise labels in UKPRank were induced from a pair-wise labeling task, and not obtained directly, as mentioned previously.
SwanRank dataset. In Swanson, Ecker, and Walker (2015) the SwanRank dataset was evaluated both in in-domain and cross-domain scenarios. In this work, we focus on the latter, more challenging scenario. Their model is a SVR with RBF kernel, with features coming from one
Figure 2: Pearson (left) and Spearman correlations of various cut-off percentiles for the BERT-FTTOPIC models, trained on data containing WA and MACE-P quality scores.
of many feature sets they experiment with. They randomly split each topic to train (75%) and test sets, and run their model, with each possible feature set, on each train-test cross-domain pair. That is, the topics of the train and the test sets are distinct. We adapt to their approach as follows:
• For each topic, we create our own random split to train and test, as the exact split to train and test for each topic was not available to us.
• For each topic, we consider its best result achieved in Swanson, Ecker, and Walker (2015) in the cross-domain scenario. For example, for evaluating the gun control topic, the best result in Swanson, Ecker, and Walker (2015) is obtained by training on the gay marriage topic. By this we obtain 4 train-test pairs: gay-marriage (GM, train)–gun control (GC, test), gun-control–gay marriage, death penalty (DP)–evolution (EV), and evolution–death penalty. It should be noted, that adapting to this setup puts our work at a disadvantage, because the best results for different pairs in Swanson, Ecker, and Walker (2015) are achieved by using different feature sets, as opposed to a single learning framework. Thus, this setting represents an upper limit rather than an actual obtained result.
• For each of the 4 pairs, we train our BERT-FT and BERTFTTOPIC methods for 2 epochs on the train topic, and test the model on the test topic. We compute Root Relative Squared Error (rrse), following Swanson, Ecker, and Walker (2015), and report results for each pair as well as the weighted-average.
Results. Results on the SwanRank dataset are presented in Table 5. Using the BERT-FTTOPIC method yielded an average improvement of .8 points compared to the optimal result taken from Swanson, Ecker, and Walker (2015). Performance has improved in 3/4 test topics (gay marriage, gun control and death penalty), and decreased in the evolution topic. However, in Swanson, Ecker, and Walker (2015) the performance on the death penalty and evolution topics is low even in the easier in-domain task, presumably indicating
they are much more difficult to predict.
Table 5: Weighted-average RRSE on the 4 topic pairs of the SwanRank dataset (a lower score is better). Swanson row: the result by averaging the best train-test pairs cross-domain results published in Swanson, Ecker, and Walker (2015).
UKPRank dataset. We conduct cross-validation where in each fold we trained on 31 topics and tested on the heldout topic. The model was evaluated after 5 training epochs. Following Simpson and Gurevych (2018), we report average Pearson (r) and Spearman () correlations, and compare the results of our methods to the Bi-LSTM and GPPL methods published there, as well as to the EviConvNet method, the best result from Gleize et al. (2019), and to the SWE+FFNN method, the best result from Potash, Ferguson, and Hazen (2019).
Results. The results on the UKPRank dataset are presented in Table 6. Both of our methods obtain Pearson correlations which are comparable to the SWE+FFNN, EviConvNet, and GPPL methods, and worse Spearman correlations. However, as the point-wise quality scores were obtained via a pair-wise proxy, previous methods assume the existence of a pair-wise labeled dataset for training, and EviConvNet, for example, trains on the pair-wise labels directly. Our method is less suitable for this setting, as it does not depend on any pairs being labeled for training, making this comparison less trivial.
Table 6: Average correlation on the UKPRank dataset.
During fine-tuning BERT to a new task, the weights of the pre-trained model are updated, and as a result, the contextual representations of tokens change. In this section, we exemplify how these new embeddings are enriched with contextual quality properties. Using an anecdotal example, we show how quality is encoded in BERT’s token embeddings, after being exposed to the IBM-Rank-30k dataset during training. We do this by comparing the contextual embeddings retrieved from BERT’s pre-trained model and from the BERT-FT model, of a common token in the IBM-Rank-30k dataset, people. For the purpose of this analysis, we use the model fine-tuned for the IBM-Rank-30k dataset with WA scores. First, we split the data to 5 equally populated bins, according to the WA quality score.12 We then retrieve the contextual embeddings of people in a sample of 20 arguments which contain it, 10 from the lowest and 10 from the highest quality bins. Embeddings are taken from the last layer of BERT’s model before and after fine-tuning, resulting in two sets of 20 embeddings overall. We cluster each set of embeddings using Hartigan’s K-Means (Hartigan 1975; Slonim et al. 2005), and calculate the adjusted mutual information (AMI) of the clusters with respect to the low and high quality bins. When clustering the embeddings retrieved from BERT’s pre-trained model, the AMI is 0.08, a low result which is expected given that BERT’s language model should not have any preference to argument quality. However, when clustering the embeddings extracted from BERT-FT, the AMI reaches 0.31, indicating that these representations absorb the qualitative nature of the arguments they come from. We use t-SNE (van der Maaten and Hinton 2008) to visualize these embeddings in a 2D space (Figure 3). Before fine-tuning, the embeddings are clustered together, whereas after fine-tuning, the embeddings are partially separated by the quality of the arguments, further indicating that the quality contributes to the contextual representation of this token.
Figure 3: A 2D t-SNE projection of the embeddings of the term people taken from a sample of high (x) and low (o) quality arguments from the dev and test sets of the IBM-Rank-30k dataset. Left - using BERT’s pre-trained model. Right - after fine-tuning.
In the rapidly expanding field of computational argumentation, argument quality is an increasingly prominent issue, having substantial implications. To advance the development of argument quality ranking models, the creation of new datasets is a critical step. In this work, we present a novel argument dataset labeled for point-wise quality, IBM-Rank-30k, containing 30,497 arguments. To the best of our knowledge, this dataset is the largest to include point-wise quality labels, 5 times larger than previously released datasets. We follow Toledo et al. (2019) by collecting the arguments actively, while employing elaborate annotation control measures. A practical question, overlooked in previous datasets, is how to induce continuous labels from binary annotations. We address this issue by conducting an extensive comparison of two common approaches and analyzing their appropriateness to our dataset. We also exploit this dataset to the task of argument quality ranking, by presenting a BERT-based neural method which outperforms several baselines. We show this method is capable of achieving promising results on other datasets as well. We believe that the approach to argument collection, the analysis of different labeling scores, as well as the sheer size of the dataset, make it useful for further advancements in this field.
As an attempt to provide insight regarding the characteristics of IBM-Rank-30k, we conducted an analysis of quality dimensions, showing that the dimensions of Global Relevance and Effectiveness are the most indicative to overall quality scores. As future work, we would like to further explore this, by investigating how quality dimensions impact overall quality, and whether a prediction model can capture these dimensions effectively.
We thank Eyal Shnarch, Leshem Choshen, and Elad Venezian for their insightful comments and suggestions.
Aristotle; Kennedy, G.; and Kennedy, G. 1991. On Rhetoric: A Theory of Civic Discourse. Oxford,: Oxford University Press.
Bar-Haim, R.; Bhattacharya, I.; Dinuzzo, F.; Saha, A.; and Slonim, N. 2017. Stance classification of context-dependent claims. In EACL 2017, 251–261. Valencia, Spain: Association for Computational Linguistics.
Bench-Capon, T.; Atkinson, K.; and McBurney, P. 2009. Altruism and agents: An argumentation based approach to designing agent decision mechanisms. In AAMAS 2009, 1073–1080. Richland, SC: International Foundation for Autonomous Agents and Multiagent Systems.
Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2018. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805.
Durmus, E.; Ladhak, F.; and Cardie, C. 2019. The role of pragmatic and discourse context in determining argument impact. In EMNLPIJCNLP 2019, 5672–5682. Hong Kong, China: Association for Computational Linguistics.
Ein-Dor, L.; Dankin, L.; Shnarch, E.; Halfon, A.; Sznajder, B.; Gera, A.; Alzate, C.; Gleize, M.; Choshen, L.; Hou, Y.; Bilu, Y.; Aharonov, R.; and Slonim, N. 2019. Corpus wide argument mining - a working solution. Forthcoming.
Gleize, M.; Shnarch, E.; Choshen, L.; Dankin, L.; Moshkowich, G.; Aharonov, R.; and Slonim, N. 2019. Are you convinced? choosing the more convincing evidence with a Siamese network. In ACL 2019, 967–976. Florence, Italy: Association for Computational Linguistics.
Graham, Y., and Baldwin, T. 2014. Testing for significance of increased correlation with human judgment. In EMNLP 2014, 172– 176. Doha, Qatar: Association for Computational Linguistics.
Graham, Y. 2015. Improving evaluation of machine translation quality estimation. In ACL-IJCNLP 2015, 1804–1813. Beijing, China: Association for Computational Linguistics.
Habernal, I., and Gurevych, I. 2016. Which argument is more convincing? analyzing and predicting convincingness of web arguments using bidirectional lstm. In ACL 2016, 1589–1599. Berlin, Germany: Association for Computational Linguistics.
Habernal, I.; Wachsmuth, H.; Gurevych, I.; and Stein, B. 2018. The argument reasoning comprehension task: Identification and reconstruction of implicit warrants. In NAACL-HLT 2018, 1930–1940. New Orleans, Louisiana: Association for Computational Linguistics.
Hartigan, J. A. 1975. Clustering Algorithms. New York, NY, USA: John Wiley & Sons, Inc., 99th edition.
Hovy, D.; Berg-Kirkpatrick, T.; Vaswani, A.; and Hovy, E. 2013. Learning whom to trust with MACE. In NAACL-HLT 2013, 1120– 1130. Atlanta, Georgia: Association for Computational Linguistics.
Levy, R.; Bogin, B.; Gretz, S.; Aharonov, R.; and Slonim, N. 2018. Towards an argumentative content search engine using weak supervision. In COLING 2018, 2066–2081. Santa Fe, New Mexico, USA: Association for Computational Linguistics.
Lippi, M., and Torroni, P. 2016. Argumentation mining: State of the art and emerging trends. ACM Trans. Internet Technol. 16(2):10:1–10:25.
Pennington, J.; Socher, R.; and Manning, C. 2014. Glove: Global vectors for word representation. In EMNLP 2014, 1532–1543. Doha, Qatar: Association for Computational Linguistics.
Perelman, C., and Olbrechts-Tyteca, L. 1969. The New Rhetoric: A Treatise on Argumentation. Notre Dame, Indiana: University of Notre Dame Press.
Peters, M.; Neumann, M.; Iyyer, M.; Gardner, M.; Clark, C.; Lee, K.; and Zettlemoyer, L. 2018. Deep contextualized word representations. In NAACL-HLT 2018, 2227–2237. New Orleans, Louisiana: Association for Computational Linguistics.
Potash, P.; Bhattacharya, R.; and Rumshisky, A. 2017. Length, interchangeability, and external knowledge: Observations from predicting argument convincingness. In IJCNLP 2017, 342–351. Taipei, Taiwan: Asian Federation of Natural Language Processing. Potash, P.; Ferguson, A.; and Hazen, T. J. 2019. Ranking passages for argument convincingness. In ArgMining@ACL 2019, 146–155. Florence, Italy: Association for Computational Linguistics.
Reed, C. 2016. Proceedings of the third workshop on argument mining (argmining2016). In ArgMining@ACL 2016. Berlin, Germany: Association for Computational Linguistics.
Reimers, N.; Schiller, B.; Beck, T.; Daxenberger, J.; Stab, C.; and Gurevych, I. 2019. Classification and clustering of arguments with contextualized word embeddings. CoRR abs/1906.09821.
Simpson, E. D., and Gurevych, I. 2018. Finding convincing arguments using scalable bayesian preference learning. TACL 6:357– 371.
Slonim, N.; Atwal, G. S.; Tkaˇcik, G.; and Bialek, W. 2005. Information-based clustering. Proceedings of the National Academy of Sciences 102(51):18297–18302.
Stab, C., and Gurevych, I. 2014. Annotating argument components and relations in persuasive essays. In COLING 2014, 1501–1510. Dublin, Ireland: Dublin City University and Association for Computational Linguistics.
Steiger, J. H. 1980. Tests for comparing elements of a correlation matrix. Psychological Bulletin 87(2):245.
Swanson, R.; Ecker, B.; and Walker, M. 2015. Argument mining: Extracting arguments from online dialogue. In SIGDIAL 2015, 217–226. Prague, Czech Republic: Association for Computational Linguistics.
Toledo, A.; Gretz, S.; Cohen-Karlik, E.; Friedman, R.; Venezian, E.; Lahav, D.; Jacovi, M.; Aharonov, R.; and Slonim, N. 2019. Automatic argument quality assessment - new datasets and methods. In EMNLP-IJCNLP 2019, 5629–5639. Hong Kong, China: Association for Computational Linguistics.
van der Maaten, L., and Hinton, G. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research 9:2579–2605.
Wachsmuth, H.; Naderi, N.; Hou, Y.; Bilu, Y.; Prabhakaran, V.; Thijm, T. A.; Hirst, G.; and Stein, B. 2017a. Computational argumentation quality assessment in natural language. In EACL 2017, 176–187. Valencia, Spain: Association for Computational Linguistics.
Wachsmuth, H.; Potthast, M.; Al-Khatib, K.; Ajjour, Y.; Puschmann, J.; Qu, J.; Dorsch, J.; Morari, V.; Bevendorff, J.; and Stein, B. 2017b. Building an argument search engine for the web. In Argmining@EMNLP 2017, 49–59. Copenhagen, Denmark: Association for Computational Linguistics.
Walton, D.; Reed, C.; and Macagno, F. 2008. Argumentation Schemes. Cambridge University Press.
Williams, E. J. 1959. Regression Analysis, volume 14. Wiley New York.