Argumentation is a verbal activity which aims at increasing or decreasing the acceptability of a controversial standpoint (van Eemeren, Grootendorst, and Snoeck Henke- mans 1996, p. 5). It is a routine which is omnipresent in our daily verbal communication and thinking. Well-reasoned arguments are not only important for decision making and learning but also play a crucial role in drawing widely-accepted conclusions.
Computational argumentation is a recent research field in computational linguistics that focuses on the analysis of arguments in natural language texts. Novel methods have broad application potential in various areas like legal decision support (Mochales- Palau and Moens 2009), information retrieval (Carstens and Toni 2015), policy making (Sardianos et al. 2015), and debating technologies (Levy et al. 2014; Rinott et al. 2015). Recently, computational argumentation has been receiving increased attention in computer-assisted writing (Song et al. 2014; Stab and Gurevych 2014b) since it allows the creation of writing support systems that provide feedback about written arguments.
Argumentation structures are closely related to discourse structures such as defined by rhetorical structure theory (RST) (Mann and Thompson 1987), Penn discourse treebank (PDTB) (Prasad et al. 2008), or segmented discourse representation theory (SDRT) (Asher and Lascarides 2003). The internal structure of an argument consists of several argument
components. It includes a claim and one or more premises (Govier 2010). The claim is a controversial statement and the central component of an argument, while premises are reasons for justifying (or refuting) the claim. Moreover, arguments have directed argumentative relations, describing the relationships one component has with another. Each such relation indicates that the source component is either a justification for or a refutation of the target component.
The identification of argumentation structures involves several subtasks like separating argumentative from non-argumentative text units (Moens et al. 2007; Florou et al. 2013), classifying argument components into claims and premises (Mochales-Palau and Moens 2011; Rooney, Wang, and Browne 2012; Stab and Gurevych 2014b), and identifying argumentative relations (Mochales-Palau and Moens 2009; Peldszus 2014; Stab and Gurevych 2014b). However, an approach which covers all subtasks is still missing. Furthermore, most approaches operate locally and do not optimize the global argumentation structure. Recently, Peldszus and Stede (2015) proposed an approach based on minimum spanning trees (MST) which jointly models argumentation structures. However, it links all argument components in a single tree structure. Consequently, it is not capable of separating several arguments and recognizing unlinked argument components (e.g. unsupported claims). In addition to the lack of end-to-end approaches for parsing argumentation structures, there are relatively few corpora annotated with argumentation structures at the discourse-level. Apart from our previous corpus (Stab and Gurevych 2014a), the few existing corpora lack non-argumentative text units (Peld- szus 2014), contain text genres different from our target domain (Kirschner, Eckle- Kohler, and Gurevych 2015), or the reliability is unknown (Reed et al. 2008).
Our primary motivation for this work is to create argument analysis methods for argumentative writing support systems and to achieve a better understanding of argumentation structures. Therefore, our first research question is whether human annotators can reliably identify argumentation structures in persuasive essays and if it is possible to create annotated data of high quality. The second research question addresses the automatic recognition of argumentation structure. We investigate if, and how accurately, argumentation structures can be identified by computational techniques. The contributions of this article are the following:
• An annotation scheme for modeling argumentation structures derived from argumentation theory. Our annotation scheme models the argumentation structure of a document as a connected tree.
• A novel corpus of 402 persuasive essays annotated with discourse-level argumentation structures. We show that human annotators can apply our annotation scheme to persuasive essays with substantial agreement.
• An end-to-end argumentation structure parser which identifies argument components at the token level and globally optimizes component types and argumentative relations.
The remainder of this article is structured as follows: In Section 2, we review related work in computational argumentation and discuss the difference to traditional discourse analysis. In Section 3, we derive our annotation scheme from argumentation theory. Section 4 presents the results of an annotation study and the corpus creation. In Section 5, we introduce the argumentation structure parser. We show that our model considerably improves the performance of base classifiers and significantly outperforms challenging heuristic baselines. We conclude the article with a discussion in Section 6.
Existing work in computational argumentation addresses a variety of different tasks. These include, for example, approaches for identifying reasoning type (Feng and Hirst 2011), argumentation style (Oraby et al. 2015), the stance of the author (Somasundaran and Wiebe 2009; Hasan and Ng 2014), the acceptability of arguments (Cabrio and Villata 2012), and appropriate support types (Park and Cardie 2014). Most relevant to our work, however, are approaches on argument mining that focus on the identification of argumentation structures in natural language texts. We categorize related approaches into the following three subtasks:
• Component identification focuses on the separation of argumentative from non-argumentative text units and the identification of argument component boundaries.
• Component classification addresses the function of argument components. It aims at classifying argument components into different types such as claims and premises.
• Structure identification focuses on linking arguments or argument components. Its objective is to recognize different types of argumentative relations such as support or attack relations.
2.1 Component Identification
Moens et al. (2007) identified argumentative sentences in various types of text such as newspapers, parliamentary records and online discussions. They experimented with various different features and achieved an accuracy of .738 with word pairs, text statistics, verbs and keyword features. Florou et al. (2013) classified text segments as argumentative or non-argumentative using discourse markers and several features extracted from the tense and mood of verbs. They report an F1 score of .764. Levy et al. (2014) proposed a pipeline including three consecutive steps for identifying context-dependent claims in Wikipedia articles. Their first component detects topic-relevant sentences including a claim. The second component detects the boundaries of each claim. The third component ranks the identified claims for identifying the most relevant claims for the given topic. Goudas et al. (2014) presented a two-step approach for identifying argument components and their boundaries in social media texts. First, they classified each sentence as argumentative or non-argumentative and achieved an accuracy of .774. Second, they segmented each argumentative sentence using a conditional random field (CRF). Their best model achieved an accuracy of .424.
2.2 Component Classification
The objective of the component classification task is to identify the type of argument components. Kwon et al. (2007) proposed two consecutive steps for identifying different types of claims in online comments. First, they classified sentences as claims and obtained an F1 score of .55 with a boosting algorithm. Second, they classified each claim as either support, oppose or propose. Their best model achieved an F1 score of .67. Rooney, Wang, and Browne (2012) applied kernel methods for classifying text units as either claims, premises or non-argumentative. They obtained an accuracy of .65. Mochales-Palau and Moens (2011) classified sentences in legal decisions as claim
or premise. They achieved an F1 score of .741 for claims and .681 for premises using a support vector machine (SVM) with domain-dependent key phrases, text statistics, verbs, and the tense of the sentence. In our previous work, we used a multiclass SVM for labeling text units of student essays as major claim, claim, premise, or non-argumentative (Stab and Gurevych 2014b). We obtained an accuracy of .773 using structural, lexical, syntactic, indicator and contextual features. Recently, Nguyen and Litman (2015) found that argument and domain words from unlabeled data increases accuracy to .79 using the same corpus, and Lippi and Torroni (2015) achieved promising results using partial tree kernels for identifying sentences containing a claim.
2.3 Structure Identification
Approaches on structure identification can be divided into macro-level approaches and micro-level approaches. Macro-level approaches such as presented by Cabrio and Vil- lata (2012), Ghosh et al. (2014), or Boltuži´c and Šnajder (2014) address relations between complete arguments and ignore the microstructure of arguments. More relevant to our work, however, are micro-level approaches, which focus on relations between argument components. Mochales-Palau and Moens (2009) introduced one of the first approaches for identifying the microstructure of arguments. Their approach is based on a manually created context-free grammar (CFG) and recognizes argument structures as trees. However, it is tailored to legal argumentation and does not recognize implicit argumentative relations, i.e. relations which are not indicated by discourse markers. In previous work, we defined the task as the binary classification of ordered argument component pairs (Stab and Gurevych 2014b). We classified each pair as support or not-linked using an SVM with structural, lexical, syntactic and indicator features. Our best model achieved an F1 score of .722. However, the approach recognizes argumentative relations locally and does not consider contextual information. Peldszus (2014) modeled the targets of argumentative relations along with additional information in a single tagset. His tagset includes, for instance, several labels denoting if an argument component at position n is argumentatively related to preceding argument components , etc. or following argument components n + 1, n + 2, etc. Although his approach achieved a promising accuracy of .48, it is only applicable to short texts. Peldszus and Stede (2015) presented the first approach which globally optimizes argumentative relations. They jointly modeled several aspects of argumentation structures using an MST model and achieved an F1 score of .720. They found that the function (support or attack) and the role (opponent and proponent) of argument components are the most useful dimensions for improving the identification of argumentative relations. Their corpus, however, is artificially created and includes a comparatively high proportion of opposing argument components (cf. Section 2.4). Therefore, it is unclear whether the results can be reproduced with real data. Moreover, their approach links all argument components in a single tree structure. Thus, it is not capable of separating several arguments and recognizing unlinked components.
2.4 Existing Corpora Annotated with Argumentation Structures
Existing corpora in computational argumentation cover numerous aspects of argumentation analysis. There are, for instance, corpora which address argumentation strength (Persing and Ng 2015), factual knowledge (Beigman Klebanov and Higgins 2012), various properties of arguments (Walker et al. 2012), argumentative relations between complete arguments at the macro-level (Cabrio and Villata 2014; Boltuži´c and Šnajder
2014), different types of argument components (Mochales-Palau and Ieven 2009; Kwon et al. 2007; Habernal and Gurevych 2016), and argumentation structures over several documents (Aharoni et al. 2014). However, corpora annotated with argumentation structures at the level of discourse are still rare.
One prominent resource is AraucariaDB (Reed et al. 2008). It includes heterogenous text types such as newspaper editorials, parliamentary records, judicial summaries and online discussions. It also includes annotations for the reasoning type and implicit argument components, which were added by the annotators during the analysis. However, the reliability of the annotations is unknown.
Kirschner, Eckle-Kohler, and Gurevych (2015) annotated argumentation structures in introduction and discussion sections of 24 German scientific articles. Their annotation scheme includes four argumentative relations (support, attack, detail and sequence). However, the corpus does not include annotations for argument component types.
Peldszus and Stede (2015) created a small corpus of 112 German microtexts with controlled linguistic and rhetoric complexity. Each document includes a single argument and does not include more than five argument components. Their annotation scheme models supporting and attacking relations as well as additional information like proponent and opponent. They obtained an inter-annotator agreement (IAA) of with three expert annotators. Recently, they translated the corpus to English resulting in the first parallel corpus for computational argumentation. However, the corpus does not include non-argumentative text units. Therefore, the corpus is only of limited use for training end-to-end argumentation structure parsers. Due to the employed writing guidelines (Peldszus and Stede 2013, p. 197), it also exhibits an unusually high proportion of attack relations. In particular, 97 of the 112 arguments (86.6%) include at least one attack relation.
Table 1 Existing corpora annotated with argumentation structures at the discourse-level (#Doc = number of documents; #Comp = number of argument components; NoArg = presence of non-argumen-tative text units; *Recent releases do not include non-argumentative text units).
In previous work, we created a corpus of 90 persuasive essays, which we selected randomly from essayforum.com (Stab and Gurevych 2014a). We annotated the corpus in two consecutive steps: First, we identified argument components at the clause level and obtained an inter-annotator agreement of between three annotators. Second, we annotated argumentative support and attack relations between argument components and achieved an inter-annotator agreement of
. In contrast to the microtext corpus from Peldszus, the corpus includes non-argumentative text units and exhibits a more realistic proportion of argumentative attack relations since the essays were not written in a controlled experiment. Apart from this corpus, we are only aware of one additional study on argumentation structures in persuasive essays. Botley (2014) analyzed 10 essays using argument diagramming for studying differences in argumentation strategies. Unfortunately, the corpus is too small for computational purposes and the reliability of the annotations is unknown. Table 1 provides an overview of existing corpora annotated with argumentation structures at the discourse-level.
2.5 Discourse Analysis
The identification of argumentation structures is closely related to discourse analysis. Similar to the identification of argumentation structures, discourse analysis aims at identifying elementary discourse units and discourse relations between them. Existing approaches on discourse analysis mainly differ in the employed discourse theory. RST (Mann and Thompson 1987), for instance, models discourse structures as trees by iteratively linking adjacent discourse units (Feng and Hirst 2014; Hernault et al. 2010) while approaches based on PDTB (Prasad et al. 2008) identify more shallow structures by linking two adjacent sentences or clauses (Lin, Ng, and Kan 2014). Whereas RST and PDTB are limited to discourse relations between adjacent discourse units, SDRT (Asher and Lascarides 2003) also allows long distance relations (Afantenos and Asher 2014; Afantenos et al. 2015). However, similar to argumentation structure parsing the main challenge of discourse analysis is to identify implicit discourse relations (Braud and Denis 2014, p. 1694).
Marcu and Echihabi (2002) proposed one of the first approaches for identifying implicit discourse relations. In order to collect large amounts of training data, they exploited several discourse markers like “because” or “but”. After removing the discourse markers, they found that word pair features are useful for identifying implicit discourse relations. Pitler, Louis, and Nenkova (2009) proposed an approach for identifying four implicit types of discourse relations in the PDTB and achieved F1 scores between .22 and .76. They found that using features tailored to each individual relation leads to the best results. Lin, Kan, and Ng (2009) showed that production rules collected from parse trees yield good results and Louis et al. (2010) found that features based on named entities do not perform as well as lexical features.
Approaches to discourse analysis usually aim at identifying various different types of discourse relations. However, only a subset of these relations is relevant for argumentation structure parsing. For example, Peldszus and Stede (2013) proposed support, attack and counter-attack relations for modeling argumentation structures, whereas our work focuses on support and attack relations. This difference is also illustrated by the work of Biran and Rambow (2011). They selected a subset of 12 relations from the RST discourse treebank (Carlson, Marcu, and Okurowski 2001) and argue that only a subset of RST relations is relevant for identifying justifications.
The study of argumentation is a comprehensive and interdisciplinary research field. It involves philosophy, communication science, logic, linguistics, psychology, and computer science. The first approaches to studying argumentation date back to the ancient Greek sophists and evolved in the 6and 5
centuries B.C. (van Eemeren, Grootendorst, and Snoeck Henkemans 1996). In particular, the influential works of Aristotle on traditional logic, rhetoric, and dialectics set an important milestone and are a cornerstone of modern argumentation theory. Due to the diversity of the field, there are numerous proposals for modeling argumentation. Bentahar, Moulin, and Bélanger (2010) categorize argumentation models into three types: (i) monological models, (ii) dialogical models, and (iii) rhetorical models. Monological models address the internal microstructure of arguments. They focus on the function of argument components, the links between them, and the reasoning type. Most monological models stem from the field of informal logic and focus on arguments as product (Johnson 2000; O’Keefe 1977). On the other hand, dialogical models focus on the process of argumentation and
ignore the microstructure of arguments. They model the external macrostructure and address relations between arguments in dialogical communications. Finally, rhetorical models consider neither the micro- nor the macrostructure but rather the way arguments are used as a means of persuasion. They consider the audience’s perception and aim at studying rhetorical schemes that are successful in practice. In this article, we focus on the monological perspective which is well-suited for developing computational methods (Peldszus and Stede 2013; Lippi and Torroni 2016).
3.1 Argument Diagramming
The laying out of argument structure is a widely used method in informal logic (Copi and Cohen 1990; Govier 2010). This technique, referred to as argument diagramming, aims at transferring natural language arguments into a structured representation for evaluating them in subsequent analysis steps (Henkemans 2000, p. 447). Although argumentation theorists consider argument diagramming a manual activity, the diagramming conventions also serve as a good foundation for developing novel argument mining models (Peldszus and Stede 2013). An argument diagram is a node-link diagram
Figure 1 Microstructures of arguments: nodes are argument components and links represent argumentative relations. Nodes at the bottom are the claims of the arguments.
whereby each node represents an argument component, i.e. a statement represented in natural language and each link represents a directed argumentative relation indicating that the source component is a justification (or refutation) of the target component. For example, Figure 1 shows some common argument structures. A basic argument includes a claim supported by a single premise. It can be considered the minimal form an argument can take. A convergent argument comprises two premises that support the claim individually; an argument is serial if it includes a reasoning chain and divergent if a single premise supports several claims (Beardsley 1950). Complementarily, Thomas (1973) defined linked arguments (Figure 1e). Like convergent arguments, a linked argument includes two premises. However, neither of the two premises independently supports the claim. The premises are only relevant to the claim in conjunction. More complex arguments can combine any of these elementary structures illustrated in Figure 1.
On closer inspection, however, there are several ambiguities when applying argument diagramming to real texts: First, the distinction between convergent and linked structures is often ambiguous in real argumentation structures (Henkemans 2000; Free- man 2011). Second, it is unclear if the argumentation structure is a graph or a tree. Third, the argumentative type of argument components is ambiguous in serial structures. We discuss each of these questions in the following sections.
3.1.1 Distinguishing between Linked and Convergent Arguments. The question if an argumentation model needs to distinguish between linked and convergent arguments is
still debated in argumentation theory (van Eemeren, Grootendorst, and Snoeck Henke- mans 1996; Freeman 2011; Yanal 1991; Conway 1991). From a perspective based on traditional logic, linked arguments indicate deductive reasoning and convergent arguments represent inductive reasoning (Henkemans 2000, p. 453). However, Freeman (2011, p. 91ff.) showed that the traditional definition of linked arguments is frequently ambiguous in everyday discourse. Yanal (1991) argues that the distinction is equivalent to separating several arguments and Conway (1991) argues that linked structures can simply be omitted for modeling single arguments. From a computational perspective, the identification of linked arguments is equivalent to finding groups of premises or classifying the reasoning type of an argument as either deductive or inductive. Accordingly, it is not necessary to distinguish linked and convergent arguments during the identification of argumentation structures since this task can be solved in subsequent analysis steps.
3.1.2 Argumentation Structures as Trees. Defining argumentation structures as trees implies the exclusion of divergent arguments, to allow only one target for each premise and to neglect cycles. From a theoretical perspective, divergent structures are equivalent to several arguments (one for each claim) (Freeman 2011, p. 16). As a result of this treatment, a great many of theoretical textbooks neglect divergent structures (Henke- mans 2000; Reed and Rowe 2004) and also most computational approaches consider arguments as trees (Mochales-Palau and Moens 2009; Cohen 1987; Peldszus 2014). However, there is little empirical evidence regarding the structure of arguments. We are only aware of one study which showed that 5.26% of the arguments in political speeches (which can be assumed to exhibit complex argumentation structures) are divergent.
Essay writing usually follows a “claim-oriented” procedure (Whitaker 2009; Shiach 2009; Perutz 2010; Kemper and Sebranek 2004). Starting with the formulation of the standpoint on the topic, authors collect claims in support (or opposition) of their view. Subsequently, they collect premises that support or attack their claims. The following example illustrates this procedure. A major claim on abortion, for instance, is “abortion should be illegal”; a supporting claim could be “abortion is ethically wrong” and the associated premises “unborn babies are considered human beings” and “killing human beings is wrong”. Due to this common writing procedure, divergent and circular structures are rather unlikely in persuasive essays. Therefore, we assume that modeling the argumentation structure of essays as a tree is a reasonable decision.
3.1.3 Argumentation Structures and Argument Component Types. Assigning argumentative types to the components of an argument is unambiguous if the argumentation structure is shallow. It is, for instance, obvious that an argument component is a premise and argument component
is a claim, if
supports
in a basic argument (cf. Figure 1). However, if the tree structure is deeper, i.e. exhibits serial structures, assigning argumentative types becomes ambiguous. Essentially, there are three different approaches for assigning argumentative types to argument components. First, according to Beardsley (1950) a serial argument includes one argument component which is both a claim and a premise. Therefore, the inner argument component bears two different argumentative types (multi-label approach). Second, Govier (2010, p. 24) distinguishes between “main claim” and “subclaim”. Similarly, Damer (2009, p. 17) distinguishes between “premise” and “subpremise” for labeling argument components in serial structures. Both approaches define specific labels for each level in the argumentation structure (level approach). Third, Cohen (1987) considers only the root node of an argumentation tree as a claim and the following nodes in the structure as premises
(“one-claim” approach). In order to define an argumentation model for persuasive essays, we propose a hybrid approach that combines the level approach and the “one-claim” approach.
3.2 Argumentation Structures in Persuasive Essays
We model the argumentation structure of persuasive essays as a connected tree structure. We use a level approach for modeling the first level of the tree and a “one-claim” approach for representing the structure of each individual argument. Accordingly, we model the first level of the tree with two different argument component types and the structure of individual arguments with argumentative relations.
The major claim is the root node of the argumentation structure and represents the author’s standpoint on the topic. It is an opinionated statement that is usually stated in the introduction and restated in the conclusion of the essay. The individual body paragraphs of an essay include the actual arguments. They either support or attack the author’s standpoint expressed in the major claim. Each argument consists of a claim and several premises. In order to differentiate between supporting and attacking arguments, each claim has a stance attribute that can take the values “for” or “against”.
We model the structure of each argument with a “one-claim” approach. The claim constitutes the central component of each argument. The premises are the reasons of the argument. The actual structure of an argument comprises directed argumentative support and attack relations, which link a premise either to a claim or to another premise (serial arguments). Each premise p has one outgoing relation, i.e. there is a relation that has p as source component, and none or several incoming relations, i.e. there can be a relation with p as target component. A claim can exhibit several incoming relations but no outgoing relation. The ambiguous function of inner premises in serial arguments is implicitly modeled by the structure of the argument. The inner premise exhibits one outgoing relation and at least one incoming relation. Finally, the stance of each premise is indicated by the type of its outgoing relation (support or attack).
The following example illustrates the argumentation structure of a persuasive essay.2 The introduction of an essay describes the controversial topic and usually includes the major claim:
Ever since researchers at the Roslin Institute in Edinburgh cloned an adult sheep, there has been an ongoing debate about whether cloning technology is morally and ethically right or not. Some people argue for and others against and there is still no agreement whether cloning technology should be permitted. However, as far as I’m concerned, [cloning is an important technology for humankind]since [it would be very useful for developing novel cures]
.
The first two sentences introduce the topic and do not include argumentative content. The third sentence contains the major claim (boldfaced) and a claim which supports the major claim (underlined). The following body paragraphs of the essay include arguments which either support or attack the major claim. For example, the following body paragraph includes one argument that supports the positive standpoint of the author on cloning:
First, [cloning will be beneficial for many people who are in need of organ transplants]. [
Cloned
organs
will
match
perfectly
the
blood
group
and
tissue
of
patients]since [
they
can
raised
from
cloned
stem
cells
the
patient]
. In addition, [
shortens
the
healing
process]
. Usually, [
very
rare
find
an
appropriateorgan
donor]
and [
using
cloning
order
raise
required
organs
The first sentence contains the claim of the argument, which is supported by five premises in the following three sentences (wavy underlined). The second sentence includes two premises, of which premisesupports claim
and premises
supports premise
. Premise
in the third sentence supports claim
. The fourth sentence includes premise
and premise
. Both support premise
. The next paragraph illustrates a body paragraph with two arguments:
Second, [scientists
use
animals
models
order
learn
about
human
diseases]
and therefore [cloning animals enables novel developments in science]
. Furthermore, [
infertile
couples
can
have
children
that
are
genetically
related]
. [
Even
samesex
couples
can
have
children]
. Consequently, [cloning can help families to get children]
.
The initial sentence includes the first argument, which consists of premiseand claim
. The following three sentences include the second argument. Premise
and premise
both support claim
in the last sentence. Both arguments cover different aspects (development in science and cloning humans) which both support the author’s standpoint on cloning. This example illustrates that knowing argumentative relations is important for separating several arguments in a paragraph. The example also shows that argument components frequently exhibit preceding text units that are not relevant to the argument but helpful for recognizing the argument component type. For example, preceding discourse connectors like “therefore”, “consequently”, or “thus” can signal a subsequent claim. Discourse markers like “because”, “since”, or “furthermore” could indicate a premises. We refer to these text units as preceding tokens. The third body paragraph illustrates a contra argument and argumentative attack relations:
Admittedly, [cloning could be misused for military purposes]. For example, [
could
used
manipulate
human
genes
order
create
obedient
soldiers
with extraordinary
abilities]
. However, because [
moral
and
ethical
values
are
internationallyshared]
, [
very
unlikely
that
cloning
will
misused
for
The paragraph begins with claim, which attacks the stance of the author. It is supported by premise
in the second sentence. The third sentence includes two premises, both of which defend the stance of the author. Premise
is an attack of claim
and premise
supports premise
. The last paragraph (conclusion) restates the major claim and summarizes the main aspects of the essay:
To sum up, although [permitting cloning might bear some risks like misuse for military purposes], I strongly believe that [this technology is beneficial to humanity]
. It is likely that [this technology bears some important cures which will significantly improve life conditions]
.
The conclusion of the essay starts with an attacking claim followed by the restatement of the major claim. The last sentence includes another claim that summarizes the most im-
portant points of the author’s argumentation. Figure 2 shows the entire argumentation structure of the example essay.
Figure 2 Argumentation structure of the example essay. Arrows indicate argumentative relations. Arrowheads denote argumentative support relations and circleheads attack relations. Dashed lines indicate relations that are encoded in the stance attributes of claims. “P” denotes premises.
The motivation for creating a new corpus is threefold: First, our previous corpus is relatively small. We believe that more data will improve the accuracy of our computational models. Second, we ensure the reproducibility of the annotation study and validate our previous results. Third, we improved our annotation guidelines. We added more precise rules for segmenting argument components and a detailed description of common essay structures. We expect that our novel annotation guidelines will guide annotators towards adequate agreement without collaborative training sessions. Our annotation guidelines comprise 31 pages and include the following three steps:
1. Topic and stance identification: We found in our previous annotation study that knowing the topic and stance of an essay improves inter-annotator agreement (Stab and Gurevych 2014a). For this reason, we ask the annotators to read the entire essay before starting with the annotation task.
2. Annotation of argument components: Annotators mark major claims, claims and premises. They annotate the boundaries of argument components and determine the stance attribute of claims.
3. Linking premises with argumentative relations: The annotators identify the structure of arguments by linking each premise to a claim or another premise with argumentative support or attack relations.
Three non-native speakers with excellent English proficiency participated in our annotation study. One of the three annotators already participated in our previous study (expert annotator). The two other annotators learned the task by independently reading the annotation guidelines. We used the brat rapid annotation tool (Stenetorp et al. 2012). It provides a graphical web interface for marking text units and linking them.
4.1 Data
We randomly selected 402 English essays from essayforum.com. This online forum is an active community which provides correction and feedback about different texts such as research papers, essays, or poetry. For example, students post their essays in order to receive feedback about their writing skills while preparing for standardized language tests. We manually reviewed each essay and selected only those with a sufficiently detailed description of the writing prompt. The corpus includes 7,116 sentences with 147,271 tokens.
4.2 Inter-Annotator Agreement
All three annotators independently annotated a random subset of 80 essays. The remaining 322 essays were annotated by the expert annotator. We evaluate the inter-annotator agreement of the argument component annotations using two different strategies: First, we evaluate if the annotators agree on the presence of argument components in sentences using observed agreement and Fleiss’ (Fleiss 1971). We consider each sentence as a markable and evaluate the presence of each argument component type
in a sentence individually. Accordingly, the number of markables for each argument component type t corresponds to the number of sentences N = 1,441, the number of annotations per markable equals with the number of annotators n = 3, and the number of categories is k = 2 (“t” or “not t”). Evaluating the agreement at the sentence level is an approximation of the actual agreement since the boundaries of argument components can differ from sentence boundaries and a sentence can include several argument components.3 Therefore, for the second evaluation strategy, we employ Krippendorff’s
(Krippendorff 2004) which considers the differences in the component boundaries at the token level. Thus, it allows for assessing the reliability of our annotation study more accurately. For determining the inter-annotator agreement, we use DKPro Agreement whose implementations of inter-annotator agreement measures are well-tested with various examples from literature (Meyer et al. 2014).
Table 2 Inter-annotator agreement of argument components.
Table 2 shows the inter-annotator agreement of each argument component type. The agreement is best for major claims. The IAA scores of 97.9% and indicate that annotators reliably identify major claims in persuasive essays. In addition, the unitized alpha measure of
shows that there are only few disagreements about the boundaries of major claims. The results also indicate good agreement for premises (
and
). We obtain the lowest agreement of
for claims which
shows that the identification of claims is more complex than identifying major claims and premises. The joint unitized measure for all argument components is , and thus the agreement improved by .043 compared to our previous study (Stab and Gurevych 2014b). Therefore, we conclude that human annotators can reliably annotate argument components in persuasive essays.
For determining the agreement of the stance attribute, we follow the same methodology as for the sentence level agreement described above, but we consider each sentence containing a claim as “for” or “against” according to its stance attribute and all sentences without a claim as “none”. Consequently, the agreement of claims constitutes the upper bound for the stance attribute. We obtain an agreement of 88.5% and which is slightly below the agreement scores of claims (cf. Table 2). Therefore, human annotators can reliably differentiate between supporting and attacking claims.
We determined the markables for evaluating the agreement of argumentative relations by pairing all argument components in the same paragraph. For each paragraph with argument components , we consider each pair
with
and
as markable. Thus, the set of all markables corresponds to all argument component pairs that can be annotated according to our guidelines. The number of argument component pairs is N = 4,922, the number of ratings per markable is n = 3, and the number of categories k = 2.
Table 3 shows the inter-annotator agreement of argumentative relations. We obtain for both argumentative support and attack relations -scores above .7 which allows tentative conclusions (Krippendorff 2004). On average the annotators marked only 0.9% of the 4,922 pairs as argumentative attack relations and 18.4% as argumentative support relations. Although the agreement is usually much lower if a category is rare (Artstein and Poesio 2008, p. 573), the annotators agree more on argumentative attack relations. This indicates that the identification of argumentative attack relations is a simpler task than identifying argumentative support relations. The agreement scores for argumentative relations are approximately .10 lower compared to our previous study. This difference can be attributed to the fact that we did not explicitly annotate relations between claims and major claims which are easy to annotate due to the known types of argument components (cf. Section 3.2).
4.3 Analysis of Human Disagreement
For analyzing the disagreements between the annotators, we determined confusion probability matrices (CPM) (Cinková, Holub, and Kríž 2012). Compared to traditional confusion matrices, a CPM also allows to analyze confusion if more than two annotators are involved in an annotation study. A CPM includes conditional probabilities that an annotator assigns a category in the column given that another annotator selected the category in the row. Table 4 shows the CPM of argument component annotations. It shows that the highest confusion is between claims and premises. We observed that one annotator frequently did not split sentences including a claim. For instance, the
Table 4 Confusion probability matrix of argument component annotations (“NoArg” indicates sentences without argumentative content).
annotator labeled the entire sentence as a claim although it includes an additional premise. This type of error also explains the lower unitized alpha score compared to the sentence level agreements in Table 2. Furthermore, we found that concessions before claims were frequently not annotated as an attacking premise. For example, annotators often did not split sentences similar to the following example:
Although [some
cases
technology
makes
people’s
life
more
complicated]
, [the convenience of technology outweighs its drawbacks]
.
The distinction between major claims and claims exhibits less confusion. This may be due to the fact that major claims are relatively easy to locate in essays since they occur usually in introductions or conclusions whereas claims can occur anywhere in the essay.
Table 5 Confusion probability matrix of argumentative relation annotations (“Not-Linked” indicates argument component pairs which are not argumentatively related).
Table 5 shows the CPM of argumentative relations. There is little confusion between argumentative support and attack relations. The CPM also shows that the highest confusion is between argumentative relations (support and attack) and unlinked pairs. This can be attributed to the identification of the correct targets of premises. In particular, we observed that agreement on the targets decreases if a paragraph includes several claims or serial argument structures.
4.4 Creation of the Final Corpus
We created a partial gold standard of the essays annotated by all annotators. We use this partial gold standard of 80 essays as our test data (20%) and the remaining 322 essays annotated by the expert annotator as our training data (80%). The creation of our gold standard test data consists of the following two steps: first, we merge the annotation of all argument components. Thus, each annotator annotates argumentative relations based on the same argument components. Second, we merge the argumentative relations to compile our final gold standard test data. Since the argument component types are strongly related - the selection of the premises, for instance, depends on the selected claim(s) in a paragraph - we did not merge the annotations using majority voting as in
our previous study. Instead, we discussed the disagreements in several meetings with all annotators for resolving the disagreements.
4.5 Corpus Statistics
Table 6 shows an overview of the size of the corpus. It contains 6,089 argument components, 751 major claims, 1,506 claims, and 3,832 premises. Such a large proportion of claims compared to premises is common in argumentative texts since writers tend to provide several reasons for ensuring a robust standpoint (Mochales-Palau and Moens 2011).
The proportion of non-argumentative text amounts to 47,474 tokens (32.2%) and 1,631 sentences (22.9%). The number of sentences with several argument components is 583 of which 302 include several components with different types (e.g. a claim followed by premise). Therefore, the identification of argument components requires the separation of argumentative from non-argumentative text units and the recognition of component boundaries at the token level. The proportion of paragraphs with unlinked argument components (e.g. unsupported claims without incoming relations) is 421 (23%). Thus, methods that link all argument components in a paragraph are only of limited use for identifying the argumentation structures in our corpus.
In total, the corpus includes 1,130 arguments, i.e. claims supported by at least one premise. Only 140 of them have an attack relation. Thus, the proportion of arguments with attack relations is considerably lower than in the microtext corpus from Peldszus and Stede (2015). Most of the arguments are convergent, i.e. the depth of the argument is one. The number of arguments with serial structure is 236 (20.9%).
Our approach for parsing argumentation structures consists of five consecutive subtasks depicted in Figure 3. The identification model separates argumentative from non-argumentative text units and recognizes the boundaries of argument components. The next three models constitute a joint model for recognizing the argumentation structure. We train two base classifiers. The argument component classification model labels each argument component as major claim, claim or premise while the argumentative relation identification model recognizes if two argument components are argumentatively linked or not. The tree generation model globally optimizes the results of the two base classifiers
Figure 3 Architecture of the argumentation structure parser
for finding a tree (or several ones) in each paragraph. Finally, the stance recognition model differentiates between support and attack relations.
For preprocessing, we use several models from the DKPro Framework (Eckart de Castilho and Gurevych 2014). We identify tokens and sentence boundaries using the LanguageTool segmenter4 and identify paragraphs by checking for line breaks. We lemmatize each token using the mate tools lemmatizer (Bohnet et al. 2013) and apply the Stanford part-of-speech (POS) tagger (Toutanova et al. 2003), constituent and dependency parsers (Klein and Manning 2003), and sentiment analyzer (Socher et al. 2013). We use a discourse parser from Lin, Ng, and Kan (2014) for recognizing PDTB-style discourse relations. We employ the DKPro TC text classification framework (Daxenberger et al. 2014) for feature extraction and experimentation.
In the following sections, we describe each model in detail. For finding the bestperforming models, we conduct model selection on our training data using 5-fold cross-validation. Then, we conduct model assessment on our test data. We determine the evaluation scores of each cross-validation experiment by accumulating the confusion matrices of each fold into one confusion matrix, which has been shown to be the less biased method for evaluating cross-validation experiments (Forman and Scholz 2010). We employ macro-averaging as described by Sokolova and Lapalme (2009) and report macro precision (P), macro recall (R) and macro F1 scores (F1). We use McNemar test (McNemar 1947) with p = .05 for significance testing. Compared to other tests, it does not make as many assumptions about the distribution in the data (Japkowicz and Shah 2014). Furthermore, this test compares the outcomes of two classifiers to the gold standard and does not require several trials. Thus, it allows for assessing the differences of the models in both of our evaluation scenarios (model selection and model assessment).
The remainder of this section is structured as follows: In the following section, we introduce the baselines and the upper bound for each task. In Section 5.2, we present the identification model that detects argument components and their boundaries. In Section 5.3, we propose a new joint model for identifying argumentation structures. In Section 5.4, we introduce our stance recognition model. In Section 5.5, we report the results of the model assessment on our test data and on the microtext corpus from Peldszus and Stede (2015). We present the results of the error analysis in Section 5.6. We evaluate the identification model independently and use the gold standard argument components for evaluating the remaining models.
5.1 Baselines and Upper Bound
For evaluating our models, we use two different types of baselines: First, we employ majority baselines which label each instance with the majority class. Table A1 in the appendix shows the class distribution in our training data and test data for each task.
Second, we use heuristic baselines, which are motivated by the common structure of persuasive essays (Whitaker 2009; Perutz 2010). The heuristic baseline of the identifica-tion task exploits sentence boundaries. It selects all sentences as argument components except the first two and the last sentence of an essay.5 The heuristic baseline of the classification task labels the first argument component in each body paragraph as claim and all remaining components in body paragraphs as premise. The last argument component in the introduction and the first argument component in the conclusion are classified as major claim and all remaining argument components in the introduction and conclusion are labeled as claim. The heuristic baseline for the relation identification classifies an argument component pair as linked if the target is the first component of a body paragraph. We expect that this baseline will yield good results because 62% of all body paragraphs in our corpus start with a claim. The heuristic baseline of the stance recognition classifies each argument component in the second last paragraph as attack. The motivation for this baseline stems from essay writing guidelines which recommend including opposing arguments in the second last paragraph.
We determine the human upper bound for each task by averaging the evaluation scores of all three annotator pairs on our test data.
5.2 Identifying Argument Components
We consider the identification of argument components as a sequence labeling task at the token level. We encode the argument components using an IOB-tagset (Ramshaw and Marcus 1995) and consider an entire essay as a single sequence. Accordingly, we label the first token of each argument component as “Arg-B”, the tokens covered by an argument component as “Arg-I”, and non-argumentative tokens as “O”. As a learner, we use a CRF (Lafferty, McCallum, and Pereira 2001) with averaged perceptron training method (Collins 2002). Since a CRF considers contextual information, the model is particularly suited for sequence labeling tasks (Goudas et al. 2014, p. 292). For each token, we extract the following features (Table 7):
Structural features capture the position of the token. We expect that these features are effective for filtering non-argumentative text units since the introductions and conclusions of essays include few argumentatively relevant content. The punctuation features indicate if the token is a punctuation and if the token is adjacent to a punctuation.
Syntactic features consist of the token’s POS as well as features extracted from the lowest common ancestor (LCA) of the current token and its adjacent tokens in the constituent parse tree. First, we define
where |lcaPath(u, v)| is the length of the path from u to the LCA of u and v, and depth the depth of the constituent parse tree. Second, we define
, which considers the current token
and its following token
.6 Additionally, we add the constituent types of both lowest common ancestors to our feature set.
Table 7 Features used for argument component identification (*indicates genre-dependent features)
Lexico-syntactic features have been shown to be effective for segmenting elementary discourse units (Hernault et al. 2010). We adopt the features introduced by Soricut and Marcu (2003). We use lexical head projection rules (Collins 2003) implemented in the Stanford tool suite to lexicalize the constituent parse tree. For each token t, we extract its uppermost node n in the parse tree with the lexical head t and define a lexico-syntactic feature as the combination of t and the constituent type of n. We also consider the child node of n in the path to t and its right sibling, and combine their lexical heads and constituent types as described by Soricut and Marcu (2003).
The probability feature is the conditional probability of the current token being the beginning of an argument component (“Arg-B”) given its preceding tokens. We maximize the probability for preceding tokens of a length up to n = 3:
To estimate these probabilities, we divide the number of times the preceding tokens with
precede a token
labeled as “Arg-B” by the total number of occurrences of the preceding tokens in our training data.
5.2.1 Results of Argument Component Identification. The results of model selection show that using all features performs best. Table B1 in the appendix shows the detailed results of the feature analysis. Table 8 shows the results of the model assessment on the test data. The heuristic baseline achieves a macro F1 score of .642 and outperforms the majority baseline by .383. It achieves an F1 score of .677 for non-argumentative tokens (“O”) and .867 for argumentative tokens (“Arg-I”). Thus, the heuristic baseline effectively separates argumentative from non-argumentative text units. However, it achieves a low F1 score of .364 for identifying the beginning of argument components (“ArgB”). Since it does not split sentences, it recognizes 145 fewer argument components compared to the number of gold standard components in the test data.
The CRF model with all features significantly outperforms the heuristic baseline (Table 8). It achieves a macro F1 score of .867. Compared to the heuristic baseline, it
Table 8 Model assessment of argument component identification (baseline heuristic)
performs considerably better in identifying the beginning of argument components. It also performs better for separating argumentative from non-argumentative text units. In addition, the number of identified argument components differs only slightly from the number of gold standard components in our test data. It identifies 1,272 argument components, whereas the number of gold standard components in our test data amounts to 1,266. The human upper bound yields a macro F1 score of .886 for identifying argument components. The macro F1 score of our model is only .019 less. Therefore, our model achieves 97.9% of human performance.
5.2.2 Error Analysis. For identifying the most frequent errors of our model, we manually investigated the predicted argument components. The most frequent errors are false positives of “Arg-I”. The model classifies 1,548 out of 9,403 non-argumentative tokens (“O”) as argumentative (“Arg-I”). The reason for these errors is threefold: First, the model frequently labels non-argumentative sentences in the conclusion of an essay as argumentative. These sentences are, for instance, non-argumentative recommendations for future actions or summarizations of the essay topic. Second, the model does not correctly recognize non-argumentative sentences in body paragraphs. It wrongly identifies argument components in 13 out of the 15 non-argumentative body paragraph sentences in our test data. The reason for these errors may be attributed to the high class imbalance in our training data. Third, the model tends to annotate lengthy non-argumentative preceding tokens as argumentative. For instance, it labels subordinate clauses preceding the actual argument component as argumentative in sentences similar to “In addition to the reasons mentioned above, [actual ‘Arg-B’] ...” (underlined text units represent the annotations of our model).
The second most frequent cause of errors are misclassified beginnings of argument components. The model classifies 137 of the 1,266 beginning tokens as “Arg-I”. The model, for instance, fails to identify the correct beginning in sentences like “Hence, from this case we are capable of stating that [actual ‘Arg-B’] ... ” or “Apart from the reason I mentioned above, another equally important aspect is that [actual ‘Arg-B’] ...”. These examples also explain the false negatives of non-argumentative tokens which are wrongly classified as “Arg-B”.
5.3 Recognizing Argumentation Structures
The identification of argumentation structures involves the classification of argument component types and the identification of argumentative relations. Both argumentative types and argumentative relations share mutual information (Stab and Gurevych 2014b, p. 54). For instance, if an argument component is classified as claim, it is less likely to exhibit outgoing relations and more likely to have incoming relations. On the other
hand, an argument component with an outgoing relation and few incoming relations is more likely to be a premise. Therefore, we propose a joint model which combines both types of information for finding the optimal structure. We train two local base classifiers. One classifier recognizes the type of argument components, and another identifies argumentative relations between argument components. For both models, we use an SVM (Cortes and Vapnik 1995) with a polynomial kernel implemented in the Weka machine learning framework (Hall et al. 2009). The motivation for selecting this learner stems from the results of our previous work, in which we found that SVMs outperform several other learners in both tasks (Stab and Gurevych 2014b, p. 51). We globally optimize the outcomes of both classifiers in order to find the optimal argumentation structure using integer linear programming.
5.3.1 Classifying Argument Components. We consider the classification of argument component types as multiclass classification and label each argument component as “major claim”, “claim” or “premise”. We experiment with the following feature groups:
Lexical features consist of binary lemmatized unigrams and the 2k most frequent dependency word pairs. We extract the unigrams from the component and its preceding tokens to ensure that discourse markers are included in the features.
Structural features capture the position of the component in the document and token statistics (Table 9). Since major claims occur frequently in introductions or conclusions, we expect that these features are valuable for differentiating component types.
Indicator features are based on four categories of lexical indicators that we manually extracted from 30 additional essays. Forward indicators such as “therefore”, “thus”, or “consequently” signal that the component following the indicator is a result of preceding argument components. Backward indicators indicate that the component following the indicator supports a preceding component. Examples of this category are “in addition”, “because”, or “additionally”. Thesis indicators such as “in my opinion” or “I believe that” indicate major claims. Rebuttal indicators signal attacking premises or contra arguments. Examples are “although”, “admittedly”, or “but”. The complete lists of all four categories are provided in Table C1 in the appendix. We define for each category a binary feature that indicates if an indicator of a category is present in the component or its preceding tokens. An additional binary feature indicates if first-person indicators are present in the argument component or its preceding tokens (Table 9). We assume that first-person indicators are informative for identifying major claims.
Contextual features capture the context of an argument component. We define eight binary features set to true if a forward, backward, rebuttal or thesis indicator precedes or follows the current component in its covering paragraph. Additionally, we count the number of noun and verb phrases of the argument component that are also present in the introduction or conclusion of the essay. These features are motivated by the observation that claims frequently restate entities or phrases of the essay topic. Furthermore, we add four binary features indicating if the current component shares a noun or verb phrase with the introduction or conclusion.
Syntactic features consist of the POS distribution of the argument component, the number of subclauses in the covering sentence, the depth of the constituent parse tree of the covering sentence, the tense of the main verb of the component, and a binary feature that indicates whether a modal verb is present in the component.
The probability features are the conditional probabilities of the current component being assigned the type given the sequence of tokens p directly preceding the component. To estimate P(t|p), we divide the number of
times the preceding tokens p appear before a component tagged as t by the total number of occurrences of p in our training data.
Discourse features are based on the output of the PDTB-style discourse parser from Lin, Ng, and Kan (2014). Each binary feature is a triple combining the following information: (1) the type of the relation that overlaps with the current argument component, (2) whether the current argument component overlaps with the first or second elementary discourse unit of a relation, and (3) if the discourse relation is implicit or explicit. For instance, the feature “Contrast_imp_Arg1” indicates that the current component overlaps with the first discourse unit of an implicit contrast relation. The use of these features is motivated by the findings of Cabrio, Tonelli, and Villata (2013). By analyzing several example arguments, they hypothesized that general discourse relations could be informative for identifying argument components.
Table 9 Features of the argument component classification model (*indicates genre-dependent features)
Embedding features are based on word embeddings trained on a part of the Google news data set (Mikolov et al. 2013). We sum the vectors of each word of an argument component and its preceding tokens and add it to our feature set. In contrast to common bag-of-words representations, embedding features have a continuous feature space that helped to achieve better results in several NLP tasks (Socher et al. 2013).
By experimenting with individual features and several feature combinations, we found that a combination of all features yields the best results. The results of the model selection can be found in Table B2 in the appendix.
5.3.2 Identifying Argumentative Relations. The relation identification model classifies ordered pairs of argument components as “linked” or “not-linked”. In this analysis step, we consider both argumentative support and attack relations as “linked”. For each paragraph with argument components , we consider
with
and
as an argument component pair. An argument component pair is “linked” if our corpus contains an argumentative relation with
as source component and
as target component. The class distribution is skewed towards “not-linked” pairs (Table A1). We experiment with the following features:
Lexical features are binary lemmatized unigrams of the source and target component and their preceding tokens. We limit the number of unigrams for both source and target component to the 500 most frequent words in our training data.
Syntactic features include binary POS features of the source and target component and the 500 most frequent production rules extracted from the parse tree of the source and target component as described in our previous work (Stab and Gurevych 2014b).
Structural features consist of the number of tokens in the source and target component, statistics on the components of the covering paragraph of the current pair, and position features (Table 10).
Indicator features are based on the forward, backward, thesis and rebuttal indicators introduced in Section 5.3.1. We extract binary features from the source and target component and the context of the current pair (Table 10). We assume that these features are helpful for modeling the direction of argumentative relations and the context of the current component pair.
Discourse features are extracted from the source and target component of each component pair as described in Section 5.3.1. Although PDTB-style discourse relations are limited to adjacent relations, we expect that the types of general discourse relations can be helpful for identifying argumentative relations. We also experimented with features capturing PDTB relations between the target and source component. However, those were not effective for capturing argumentative relations.
PMI features are based on the assumption that particular words indicate incoming or outgoing relations. For instance, tokens like “therefore”, “thus”, or “hence” can signal incoming relations, whereas tokens such as “because”, “since”, or “furthermore” may indicate outgoing relations. To capture this information, we use pointwise mutual information (PMI) which has been successfully used for measuring word associations (Turney 2002; Church and Hanks 1990). However, instead of determining the PMI of two words, we estimate the PMI between a lemmatized token t and the direction of a relation d = {incoming, outgoing} as . Here, p(t, d) is the probability that token t occurs in an argument component with either incoming or outgoing relations. The ratio between p(t, d) and p(t) p(d) indicates the dependence between a token and the direction of a relation. We estimate PMI(t, d) for each token in our training data. We extract the ratio of tokens positively and negatively associated with incoming or outgoing relations for both source and target component. Additionally, we extract four binary features which indicate if any token of the components has a positive or negative association with either incoming or outgoing relations.
Shared noun features (shNo) indicate if the source and target component share a noun. We also add the number of shared nouns to our feature set. These features are motivated by the fact that premises and claims in classical syllogisms share the same subjects (Govier 2010, p. 199).
For selecting the best performing model, we conducted feature ablation tests and experimented with individual features. The results show that none of the feature groups
Table 10 Features used for argumentative relation identification (*indicates genre-dependent features)
is informative when used individually. We achieved the best performance by removing lexical features from our feature set (detailed results of the model selection can be found in Table B3 in the appendix).
5.3.3 Jointly Modeling Argumentative Relations and Argument Component Types. Both base classifiers identify argument component types and argumentative relations locally. Consequently, the results may not be globally consistent. For instance, the relation identification model does not link 37.1% of all premises in our model selection experiments. Therefore, we propose a joint model that globally optimizes the outcomes of the two base classifiers. We formalize this task as an integer linear programming (ILP) problem. Given a paragraph including n argument components7, we define the following objective function
with variables indicating an argumentative relation from argument component i to argument component j.8 Each coefficient
is a weight of a relation. It is determined by incorporating the outcomes of the two base classifiers. For ensuring that
the resulting structure is a tree, we define the following constraints:
Equation ?? prevents an argument component i from having more than one outgoing relation. Equation ?? ensures that a paragraph includes at least one root node, i.e. a node without outgoing relation. Equation ?? prevents an argumentative relation from having the same source and target component.
For preventing cycles, we adopt the approach described by Kübler et al. (2008, p. 92). We add the auxiliary variables to our objective function (??) where
1 if there is a directed path from argument component i to argument component j. The following constraints tie the auxiliary variables
to the variables
:
The first constraint ensures that there is a path from i to j represented in variable if there is a direct relation between the argument components i and j. The second constraint covers all paths of length greater than 1 in a transitive way. It states that if there is a path from argument component i to argument component
) and another path from argument component j to argument component
) then there is also a path from argument component i to argument component k. Thus, it iteratively covers paths of length l + 1 by having covered paths of length l. The third constraint prevents cycles by preventing all directed paths starting and ending with the same argument component.
Having defined the ILP model, we consolidate the results of the two base classifiers. We consider this task by determining the weight matrix that includes the coefficients
of our objective function. The weight matrix W can be considered an adjacency matrix. The greater a weight of a particular relation is, the higher the likelihood that the relation appears in the optimal structure found by the ILP-solver.
First, we incorporate the results of the relation identification model. Its result can be considered as an adjacency matrix . For each pair of argument components (i, j) with
, each
is 1 if the relation identification model predicts an argumentative relation from argument component i (source) to argument component j (target), or 0 if the model does not predict an argumentative relation.
Second, we derive a claim score (cs) for each argument component i from the predicted relations in R:
Here, is the number of predicted incoming relations of argu- ment component
is the number of predicted outgoing rela- tions of argument component i and
is the total number of relations predicted in the current paragraph. The claim score
is greater for argument components with many incoming relations and few outgoing relations. It becomes smaller for argument components with fewer incoming relations and more outgoing relations. By normalizing the score with the total number of predicted relations and argument components, it also accounts for contextual information in the current paragraph and prevents overly optimistic scores. For example, if all predicted relations point to argument component i which has no outgoing relations,
is exactly 1. On the other hand, if there is an argument component j with no incoming and one outgoing relation in a paragraph with 4 argument components and 3 predicted relations in
is
. Since it is more likely that a relation links an argument component which has a lower claim score to an argument component with a higher claim score, we determine the weight for each argumentative relation as:
By adding the claim score of the target component j, we assign a higher weight to relations pointing to argument components which are likely to be a claim. By subtracting the claim score
of the source component i, we assign smaller weights to relations outgoing argument components with larger claim score.
Third, we incorporate the argument component types predicted by the classification model. We assign a higher score to the weight if the target component j is predicted as claim since it is more likely that argumentative relations point to claims. Accordingly, we set
if argument component j is labeled as claim and
if argument component j is labeled as premise.
Finally, we combine all three scores to estimate the weights of the objective function:
Each represents a hyperparameter of the ILP model. In our model selection experiments, we found that
and
yields the best performance. More detailed results of the model selection are provided in Table B4 in the appendix.
After applying the ILP model, we adapt the argumentative relations and argument types according to the results of the ILP-solver. We revise each relation according to the determined scores, set the type of all components without outgoing relation to “claim”, and set the type of all remaining components to “premise”.
5.4 Classifying Support and Attack Relations
The stance recognition model differentiates between argumentative support and attack relations. We model this task as binary classification and classify each claim and premise as “support” or “attack”. The stance of each premise is encoded in the type of its outgoing relation, whereas the stance of each claim is encoded in its stance attribute. We use an SVM and the following features (Table 11)9:
Lexical features are binary lemmatized unigram features of the argument component and its preceding tokens.
Sentiment features are based on the subjectivity lexicon from Wilson, Wiebe, and Hoffmann (2005) and the five sentiment scores produced by the Stanford sentiment analyzer (Socher et al. 2013).
Syntactic features consist of the POS distribution of the component and production rules (Stab and Gurevych 2014b).
Structural features capture the position of the component in the paragraph and token statistics (Table 11).
Discourse features are discourse triples as described in Section 5.3.1. We expect that these features will be helpful for identifying attacking components since the PDTB includes contrast and concession relations.
Embedding features are the embedding features described in Section 5.3.1.
Table 11 Features used for stance recognition
5.5 Evaluation
The upper part of Table 12 shows the F1 scores of the classification, relation identifica-tion, and stance recognition tasks using our test data. The heuristic baselines outperform the majority baselines in all three tasks by a considerable margin. They achieve an average macro F1 score of .674, which confirms our assumption that argumentation structures in persuasive essays can be identified with simple heuristic rules (Section 5.1).
Our base classifiers for component classification and relation identification both improve the macro F1 scores of the heuristic baselines. The component classification model achieves a macro F1 score of .794. Compared to the heuristic baseline, the model yields slightly worse results for claims and premises but improves the identification
of major claims by .132. However, the difference between the component classification model and the heuristic baseline is not statistically significant. On the other hand, the relation identification model significantly improves the result of the heuristic baseline, achieving a macro F1 score of .717. Additionally, the stance recognition model signifi-cantly outperforms the heuristic baseline by .118 macro F1 score. It yields an F1 score of .947 for supporting components and .413 for attacking component.
Table 12 F1 scores of model assessment. The upper part shows the results on the test data of the persuasive essay corpus, while the lower part shows the results on the microtext corpus from Peldszus and Stede (2015) (improvement over base classifier).
The ILP joint model significantly outperforms the heuristic baselines for component classification and relation identification. Additionally, it significantly outperforms the base classifier for component classification. However, it does not yield a significant improvement over the base classifier for relation identification despite that the ILP joint model improves the base classifier for relation identification by .034 macro F1 score. The results show that the identification of claims and linked component pairs benefit most from the joint model. Compared to the base classifiers, the ILP joint model improves the F1 score of claims by .071 and the F1 score of linked component pairs by .077.
The human upper bound yields macro F1 scores of .868 for component classifica-tion, .854 for relation identification, and .844 for stance recognition. The ILP joint model achieves almost human performance for classifying argument components. Its F1 score is only .042 lower compared to human upper bound. Regarding relation identification and stance recognition, the F1 scores of our model are .103 and .164 less than human performance. Thus, our model achieves 95.2% human performance for component identification, 87.9% for relation identification, and 80.5% for stance recognition.
In order to verify the effectiveness of our approach, we also evaluated the ILP joint model on the English microtext corpus (cf. Setion 2.4). For ensuring the comparability to previous results, we used the same data splitting and the repeated cross-validation setup described by Peldszus and Stede (2015). Since the microtext corpus does not include major claims, we removed the major claim label from our component classification model for this evaluation task. Furthermore, it was necessary to adapt several features of the base classifiers since the microtext corpus does not include non-argumentative text units. Therefore, we did not consider preceding tokens for lexical, indicator and embedding features and removed the probability feature of the compo-
nent classification model. Additionally, we removed all genre-dependent features of both base classifiers.
The first three rows of the lower part in Table 12 show the results reported by Peldszus and Stede (2015) on the English microtext corpus. The simple model indicates their local base classifiers, Best EG is their best model for component classification, and MP+p is their best model for relation identification. On average our base classifiers outperform the base classifiers from Peldszus and Stede (2015) by .025. Only their relation identification model yields a better macro F1 score compared to our base classifier. Their Best EG model outperforms our model with respect to component classification and relation identification but yields a lower score for stance recognition. Their MP+p model outperforms the relation identification of our model, but yields lower results for component classification and stance recognition compared to our ILP joint model. This difference can be attributed to the additional information about the function and role attribute incorporated in their joint models (cf. Section 2.3). They showed that both have a beneficial effect on the component classification and relation identification in their corpus (Peldszus and Stede 2015, Figure 3). However, the role attribute is a unique feature of their corpus and the arguments in their corpus exhibit an unusually high proportion of attack relations (cf. Section 2.4). In particular, 86.6% of their arguments include attack relations, whereas the proportion of arguments with attack relations in our corpus amounts to only 12.4%. This proportion may even be lower in other text genres because essay writing guidelines encourage students to include opposing arguments in their writing. Therefore, we assume that incorporating function and role attributes will not be beneficial using our corpus.
The evaluation results show that our ILP joint model simultaneously improves the performance of component classification and relation identification on both corpora.
5.6 Error Analysis
In order to analyze frequent errors of the ILP joint model, we investigated the predicted argumentation structures in our test data. The confusion matrix of the component classification task (Table 13) shows that the highest confusion is between claims and premises. The model classifies 74 actual premises as claims and 82 claims as premises. By manually investigating these errors, we found that the model tends to label inner premises in serial structures as claims and wrongly identifies claims in sentences containing two premises. Regarding the relation identification, we observed that the
Table 13 Confusion matrix of the ILP joint model of component classification on our test data
model tends to identify argumentation structures which are more shallow than the structures in our gold standard. The model correctly identifies only 34.7% of the 98 serial arguments in our test data. This can be attributed to the “claim-centered” weight calculation in our objective function. In particular, the predicted relations in matrix R are the only information about serial arguments, whereas the other two scores (c and cr) assign higher weights to relations pointing to claims.
Stab and Gurevych Parsing Argumentation Structures
In order to determine if the ILP joint model correctly models the relationship between component types and argumentative relations, we artificially improved the predictions of both base classifiers as suggested by Peldszus and Stede (2015). The dashed lines in Figure 4 show the performance of the artificially improved base classifiers. Continuous lines show the resulting performance of the ILP joint model. Figures 4a+b
Figure 4 Influence of improving the base classifiers (x-axis shows the proportion of improved predictions and y-axis the macro F1 score).
show the effect of improving the component classification and relation identification. It shows that correct predictions of one base classifier are not maintained after applying the ILP model if the other base classifier exhibits less accurate predictions. In particular, less accurate argumentative relations have a more detrimental effect on the component types (Figure 4a) than less accurate component types do on the outcomes of the relation identification (Figure 4b). Thus, it is more reasonable to focus on improving relation identification than component classification in future work.
Figure 4c depicts the effect of improving both base classifiers, which illustrates that the ILP joint model improves the component types more effectively than argumentative relations. Figure 4c also shows that the ILP joint model improves both tasks if the base classifiers are improved. Therefore, we conclude that the ILP joint model successfully captures the natural relationship between argument component types and argumentative relations.
Our argumentation structure parser includes several consecutive steps. Consequently, potential errors of the upstream models can negatively influence the results of the downstream models. For example, errors of the identification model can result in flawed argumentation structures if argumentatively relevant text units are not recognized or non-argumentative text units are identified as relevant. However, our identification model yields good accuracy and an of .958 for identifying argument components. Therefore, it is unlikely that identification errors will significantly influence the outcome of the upstream models when applied to persuasive essays. However, as demonstrated by Levy et al. (2014) and Goudas et al. (2014), the identification of argument components is more complex in other text genres than it is in persuasive essays. Another potential issue of the pipeline architecture is that wrongly classified major claims will decrease the accuracy of the model due to the fact that those are not integrated in the joint modeling
approach. For this reason, it is worthwhile to experiment in future work with structured machine learning methods which incorporate several tasks in a single model (Moens 2013).
In this work, we have demonstrated that our annotation scheme can be reliably applied to persuasive essays. However, persuasive essays exhibit a common structure and so it may be more challenging to apply the annotation scheme to text genres with less explicit argumentation structures such as social media data, product reviews or dialogical debates. Nevertheless, we believe that our annotation scheme can be successfully applied to other text genres with minor adaptations. Although other text genres may not include major claims, previous work has already demonstrated that claims and premises can be reliably annotated in legal cases (Mochales-Palau and Moens 2011), written dialogs (Biran and Rambow 2011) and even over multiple Wikipedia articles (Aharoni et al. 2014). Additionally, it is unknown if our tree assumption generalizes to other text genres. Although most previous work considered argumentation structures as trees, other text genres may include divergent arguments and even cyclic argumentation structures.
Although our approach shows promising results, it is still unknown if the identified argumentation structures can be used to provide adequate feedback about argumentation. However, the identified argumentation structures enable various kinds of feedback about argumentation. For instance, it facilitates the automatic recommendation of more meaningful and comprehensible argumentation structure. Particularly, the extracted structure can be used to prevent multiple reasoning directions in a single argument (e.g. both forward and backward reasoning), which may result in a more comprehensible structure of arguments. It could be also used to highlight unsupported claims and then prompt the author for reasons supporting or attacking it (e.g. premises related to the claim). Additionally, the identified argumentation structure facilitates the recommendation of additional discourse markers in order to make the arguments more coherent or can be used to encourage authors to discuss opposing views. Finally, the visualization of the identified argumentation structure could stimulate self reflection and plausibility checking. However, finding adequate feedback types and investigating their effect on the argumentation skills of students requires the integration of the models in writing environments and extensive long term user studies in future work.
In this paper, we presented an end-to-end approach for parsing argumentation structures in persuasive essays. Previous approaches suffer from several limitations: Existing approaches either focus only on particular subtasks of argumentation structure parsing or rely on manually created rules. Consequently, previous approaches are only of limited use for parsing argumentation structures in real application scenarios. To the best of our knowledge, the presented work is the first approach which covers all required subtasks for identifying the global argumentation structure of documents. We showed that jointly modeling argumentation structures simultaneously improves the results of component classification and relation identification. Additionally, we introduced a novel annotation scheme and a new corpus of persuasive essays annotated with argumentation structures which represents the largest resource of its kind. Both the corpus and the annotation guidelines are freely available in order to ensure reproducibility and for fostering future research in computational argumentation.
Table A1 shows the class distributions of the training and test data of the persuasive essay corpus for each analysis step.
The following tables show the model selection results for all five tasks using 5-fold cross-validation on our training data. Table B1 shows the results of using individual feature groups for the argument component identification task. Lexico-syntactic features perform best for identifying argument components, and they perform particularly well for recognizing the beginning of argument components (“Arg-B”). The second best features are structural features. They yield the best F1 score for separating argumentative from non-argumentative text units (“O”). Syntactic features are useful for identifying the
Table B1 Argument component identification (
beginning of argument components. The probability feature yields the lowest score. Nevertheless, we observe a significant decrease of .028 F1 score of “Arg-B” when evaluating the system without the probability feature. We obtain the best results by using all features. Since persuasive essays exhibit a particular paragraph structure which may not be present in other text genres (e.g. user-generated web discourse), we also evaluate the model without genre-dependent features (cf. Table 7). This yields a macro F1 score of .847 which is only .002 less compared to the model with all features.
Table B2 shows the model selection results of the classification model. Structural features are the only features which significantly outperform the heuristic baseline when used individually. They are the most effective features for identifying major claims. The second-best features for identifying claims are discourse features. With this knowledge, we can confirm the assumption that general discourse relations are useful for component classification (cf. Section 5.3.1). However, embedding features do
Table B2 Argument component classification (
not perform as well as lexical features. They yield lower F1 scores for major claims and claims. Contextual features are effective for identifying major claims since they implicitly capture if an argument component is present in the introduction or conclusion (cf. Section 5.3.1). Indicator features are most effective for identifying major claims but contribute only slightly to the identification of claims. Syntactic features are predictive of major claims and premises but are not effective for recognizing claims. The probability features are not informative for identifying claims, probably because forward indicators may also signal inner premises in serial structures. Omitting probability and embedding features yields the best accuracy. However, we select the best performing system by means of the macro F1 score which is more appropriate for imbalanced data sets. Accordingly, we select the model which uses all features (Table B2).
The model selection results for relation identification are shown in Table B3. We report the results of feature ablation tests since none of the feature groups yields remarkable results when used individually. We also found that removing any of the feature
Table B3 Argumentative relation identification (significant difference compared to SVM all features)
groups does not yield a significant difference compared to the model with all features. Structural features are the most effective features for identifying relations. The secondand third-most effective feature groups are indicator and PMI features. Both syntactic and discourse features yield a slight improvement when combining them with other features. Removing the shared noun features does not yield a difference in accuracy or macro F1 score although we observe a decrease of .002 macro F1 score when removing them from our best performing model. We achieve the best results by removing lexical features from the feature set.
Table B4 shows the model selection results of the ILP joint model. Base+heuristic shows the result of applying the baseline to all paragraphs in which the base classifiers identify neither claims nor argumentative relations. The heuristic baseline is triggered in 31 paragraphs which results in 3.3% more trees identified compared to the base classifiers. However, the difference between Base+heuristic and the base classifiers is not statistically significant. For this reason, we can attribute any further improvements to the joint modeling approach. Moreover, Table B4 shows selected results of the hyper-
Table B4 Joint modeling approach (improvement over base classifier; Cl
number of premises converted to claims; Trees = Percentage of correctly identified trees)
parameter tuning of the ILP joint model. Using only predicted relations in the ILP-naïve model does not yield an improvement over the base classifiers. ILP-relation uses only information from the relation identification base classifier. It significantly outperforms both base classifiers but converts a large number of premises to claims. The ILP-claim model uses only the outcomes of the argument component base classifier and improves neither component classification nor relation identification. All three models identify a relatively high proportion of claims compared to the number of claims in our training data. The reason for this is that many weights in W are 0. Combining the results of both base classifiers yields a considerably more balanced proportion of component type conversions. All three models (ILP-equal, ILP-same, and ILP-balanced) significantly outperform the base classifier for component classification. We identify the best performing system by means of the average macro F1 score for both tasks. Accordingly, we select ILP-balanced as our best performing ILP joint model.
Table B5 shows the model selection results for the stance recognition model. Using sentiment, structural and embedding features individually does not yield an improvement over the majority baseline. However, lexical, syntactic and discourse features yield a significant improvement over the heuristic baseline when used individually. Although lexical features perform best individually, there is no significant difference when remov-
ing them from the feature set. Since omitting any of the feature groups yields a lower macro F1 score, we select the model with all features as the best performing model.
Table B5 Stance recognition (compared to SVM all features)
Table C1 shows all of the lexical indicators we extracted from 30 persuasive essays. The lists include 22 forward indicators, 33 backward indicators, 48 thesis indicators and 10 rebuttal indicators.
agreement for computational linguistics.
Computational Linguistics, 34(4):555–596.
Conversation. Cambridge University Press,
Cambridge.
Beigman Klebanov, Beata and Derrick
Higgins. 2012. Measuring the use of
factual information in test-taker essays. In
Proceedings of the Seventh Workshop on
Building Educational Applications Using
NLP, pages 63–72, Montreal, Quebec,
Canada.
Bentahar, Jamal, Bernard Moulin, and Micheline Bélanger. 2010. A taxonomy of argumentation models used for knowledge representation. Artificial Intelligence Review, 33(3):211–259.
Carlson, Lynn, Daniel Marcu, and
Mary Ellen Okurowski. 2001. Building a
discourse-tagged corpus in the framework
of rhetorical structure theory. In Proceedings
of the Second SIGdial Workshop on Discourse
and Dialogue - Volume 16, SIGDIAL ’01,
pages 1–10, Aalborg, Denmark.
13:145–158.
TC: A Java-based framework for supervised learning experiments on textual data. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics. System Demonstrations, pages 61–66, Baltimore, MD, USA.
Eckart de Castilho, Richard and Iryna Gurevych. 2014. A broad-coverage collection of portable NLP components for building shareable analysis pipelines. In Nancy Ide and Jens Grivolla, editors, Proceedings of the Workshop on Open Infrastructures and Analysis Frameworks for HLT (OIAF4HLT) at COLING 2014, pages 1–11, Dublin, Ireland.
Computational Linguistics, in press.
[Hasan and Ng2014]Hasan, Kazi Saidul and Vincent Ng. 2014. Why are you taking this stance? identifying and classifying reasons in ideological debates. In Proceedings of the 2014 Conference on Empirical Methods in
Natural Language Processing (EMNLP),
pages 751–762, Doha, Qatar.
Persuasive Essays. Great Source Education
Group.
Kirschner, Christian, Judith Eckle-Kohler,
and Iryna Gurevych. 2015. Linking the
thoughts: Analysis of argumentation
structures in scientific publications. In
Proceedings of the 2nd Workshop on Argumentation Mining, pages 1–11, Denver, CO, USA.
Lafferty, John D., Andrew McCallum, and Fernando C. N. Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of the Eighteenth International Conference on Machine Learning, ICML ’01, pages 282–289, San Francisco, CA, USA.
le Cessie, S. and J.C. van Houwelingen. 1992. Ridge estimators in logistic regression. Applied Statistics, 41(1):191–201.
Daniel Hershcovich, Ehud Aharoni, and Noam Slonim. 2014. Context dependent claim detection. In Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014), pages 1489–1500, Dublin, Ireland.
claim detection for argument mining. In
Proceedings of the Twenty-Fourth
International Joint Conference on Artificial
Intelligence (IJCAI 2015), pages 185–191,
Buenos Aires, Argentina.
Mochales-Palau, Raquel and Aagje Ieven.
2009. Creating an argumentation corpus:
do theories apply to real arguments?: A
case study on the legal argumentation of
the ECHR. In Proceedings of the 12th
International Conference on Artificial
Intelligence and Law (ICAIL ’09), pages 21–30, Barcelona, Spain.
Mochales-Palau, Raquel and Marie-Francine Moens. 2009. Argumentation mining: The detection, classification and structure of arguments in text. In Proceedings of the 12th International Conference on Artificial Intelligence and Law, ICAIL ’09, pages 98–107, Barcelona, Spain.
Mochales-Palau, Raquel and Marie-Francine Moens. 2011. Argumentation mining. Artificial Intelligence and Law, 19(1):1–22.
Somasundaran, Swapna and Janyce Wiebe. 2009. Recognizing stances in online debates. In Proceedings of the 47th Annual Meeting of the ACL and the 4th IJCNLP of the AFNLP, ACL ’09, pages 226–234, Suntec, Singapore.
van Eemeren, Frans H., Rob Grootendorst,
and Francisca Snoeck Henkemans. 1996.
Fundamentals of Argumentation Theory: A
Handbook of Historical Backgrounds and
Contemporary Developments. Routledge,
Taylor & Francis Group.