File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1317_metho.xml
Size: 23,669 bytes
Last Modified: 2025-10-06 14:10:39
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1317"> <Title>Classification of Discourse Coherence Relations: An Exploratory Study using Multiple Knowledge Sources</Title> <Section position="5" start_page="117" end_page="118" type="metho"> <SectionTitle> 3 GraphBank </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="117" end_page="117" type="sub_section"> <SectionTitle> 3.1 Coherence Relations </SectionTitle> <Paragraph position="0"> For annotating the discourse relations in text, Wolf and Gibson (2005) assume a clause-unit-based definition of a discourse segment. They define four broad classes of coherence relations: (1) 1. Resemblance: similarity (par), contrast (contr), example (examp), generalization (gen), elaboration (elab); 2. Cause-effect: explanation (ce), violated expectation (expv), condition (cond); 3. Temporal (temp): essentially narration; 4. Attribution (attr): reporting and evidential contexts.</Paragraph> <Paragraph position="1"> The textual evidence contributing to identifying the various resemblance relations is heterogeneous at best, where, for example, similarity and contrast are associated with specific syntactic constructions and devices. For each relation type, there are well-known lexical and phrasal cues: (2) a. similarity: and; b. contrast: by contrast, but; c. example: for example; d. elaboration: also, furthermore, in addition, note that; e. generalization: in general.</Paragraph> <Paragraph position="2"> However, just as often, the relation is encoded through lexical coherence, via semantic association, sub/supertyping, and accommodation strategies (Asher and Lascarides, 2003). The cause-effect relations include conventional causation and explanation relations (captured as the label ce), such as (3) below: (3) cause: SEG1: crash-landed in New Hope, Ga., effect: SEG2: and injuring 23 others.</Paragraph> <Paragraph position="3"> It also includes conditionals and violated expectations, such as (4).</Paragraph> <Paragraph position="4"> (4) cause: SEG1: an Eastern Airlines Lockheed L-1011 en route from Miami to the Bahamas lost all three of its engines, effect: SEG2: and land safely back in Miami. The two last coherence relations annotated in GraphBank are temporal (temp) and attribution (attr) relations. The first corresponds generally to the occasion (Hobbs, 1985) or narration (Asher and Lascarides, 2003) relation, while the latter is a general annotation over attribution of source.2</Paragraph> </Section> <Section position="2" start_page="117" end_page="118" type="sub_section"> <SectionTitle> 3.2 Discussion </SectionTitle> <Paragraph position="0"> The difficulty of annotating coherence relations consistently has been previously discussed in the literature. In GraphBank, as in any corpus, there are inconsistencies that must be accommodated for learning purposes. As perhaps expected, annotation of attribution and temporal sequence relations was consistent if not entirely complete. The most serious concern we had from working with thea6 corpus derives from the conflation of diverse and semantically contradictory relations among the cause-effect annotations. For canonical causation pairs (and their violations) such as those above, (3) and (4), the annotation was expectedly consistent and semantically appropriate. Problems arise, however when examining the treatment of purpose clauses and rationale clauses. These are annotated, according to the guidelines, as cause-effect pairings. Consider (5) below.</Paragraph> <Paragraph position="1"> (5) cause: SEG1: to upgrade lab equipment in 1987.</Paragraph> <Paragraph position="2"> effect: SEG2: The university spent $ 30,000 This is both counter-intuitive and temporally false. The rationale clause is annotated as the cause, and the matrix sentence as the effect. Things are even worse with purpose clause annotation. Consider the following example discourse:3 (6) John pushed the door to open it, but it was locked.</Paragraph> <Paragraph position="3"> This would have the following annotation in GraphBank: (7) cause: to open it effect: John pushed the door.</Paragraph> <Paragraph position="4"> The guideline reflects the appropriate intuition that the intention expressed in the purpose or rationale clause must precede the implementation of the action carried out in the matrix sentence. In effect, this would be something like (8) [INTENTION TO SEG1] CAUSES SEG2 The problem here is that the cause-effect relation conflates real event-causation with telosdirected explanations, that is, action directed towards a goal by virtue of an intention. Given that these are semantically disjoint relations, which are furthermore triggered by distinct grammatical constructions, we believe this conflation should be undone and characterized as two separate coherence relations. If the relations just discussed were annotated as telic-causation, the features encoded for subsequent training of a machine learning algorithm could benefit from distinct syntactic environments. We would like to automatically generate temporal orderings from cause-effect relations from the events directly annotated in the text. 3This specific example was brought to our attention by Alex Lascarides (p.c).</Paragraph> <Paragraph position="5"> Splitting these classes would preserve the soundness of such a procedure, while keeping them lumped generates inconsistencies.</Paragraph> </Section> </Section> <Section position="6" start_page="118" end_page="120" type="metho"> <SectionTitle> 4 Data Preparation and Knowledge </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="118" end_page="118" type="sub_section"> <SectionTitle> Sources </SectionTitle> <Paragraph position="0"> In this section we describe the various linguistic processing components used for classification and identification of GraphBank discourse relations.</Paragraph> </Section> <Section position="2" start_page="118" end_page="118" type="sub_section"> <SectionTitle> 4.1 Pre-Processing </SectionTitle> <Paragraph position="0"> We performed tokenization, sentence tagging, part-of-speech tagging, and shallow syntactic parsing (chunking) over the 135 GraphBank documents. Part-of-speech tagging and shallow parsing were carried out using the Carafe implementation of Conditional Random Fields for NLP (Wellner and Vilain, 2006) trained on various standard corpora. In addition, full sentence parses were obtained using the RASP parser (Briscoe and Carroll, 2002). Grammatical relations derived from a single top-ranked tree for each sentence (headword, modifier, and relation type) were used for feature construction.</Paragraph> </Section> <Section position="3" start_page="118" end_page="119" type="sub_section"> <SectionTitle> 4.2 Modal Parsing and Temporal Ordering of Events </SectionTitle> <Paragraph position="0"> We performed both modal parsing and temporal parsing over events. Identification of events was performed using EvITA (Saur'i et al., 2006), an open-domain event tagger developed under the TARSQI research framework (Verhagen et al., 2005). EvITA locates and tags all event-referring expressions in the input text that can be temporally ordered. In addition, it identifies those grammatical features implicated in temporal and modal information of events; namely, tense, aspect, polarity, modality, as well as the event class. Event annotation follows version 1.2.1 of the TimeML specifications.4 Modal parsing in the form of identifying subordinating verb relations and their type was performed using SlinkET (Saur'i et al., 2006), another component of the TARSQI framework. SlinkET identifies subordination constructions introducing modality information in text; essentially, infinitival and that-clauses embedded by factive predicates (regret), reporting predicates (say), and predicates referring to events of attempting (try), volition (want), command (order), among others.</Paragraph> <Paragraph position="1"> SlinkET annotates these subordination contexts and classifies them according to the modality information introduced by the relation between the embedding and embedded predicates, which can be of any of the following types: a7 factive: The embedded event is presupposed or entailed as true (e.g., John managed to leave the party).</Paragraph> <Paragraph position="2"> a7 counter-factive: The embedded event is presupposed as entailed as false (e.g., John was unable to leave the party).</Paragraph> <Paragraph position="3"> a7 evidential: The subordination is introduced by a reporting or perception event (e.g., Mary saw/told that John left the party).</Paragraph> <Paragraph position="4"> a7 negative evidential: The subordination is a reporting event conveying negative polarity (e.g., Mary denied that John left the party).</Paragraph> <Paragraph position="5"> a7 modal: The subordination creates an intensional context (e.g., John wanted to leave the party).</Paragraph> <Paragraph position="6"> Temporal orderings between events were identified using a Maximum Entropy classifier trained on the TimeBank 1.2 and Opinion 1.0a corpora.</Paragraph> <Paragraph position="7"> These corpora provide annotated events along with temporal links between events. The link types included: before (a8a10a9 occurs before a8a12a11 ) , includes (a8 a11 occurs sometime during a8 a9 ), simultaneous (a8a13a9 occurs over the same interval as a8a14a11 ), begins (a8 a9 begins at the same time as a8 a11 ), ends (a8 a9 ends at the same time as a8 a11 ).</Paragraph> </Section> <Section position="4" start_page="119" end_page="120" type="sub_section"> <SectionTitle> 4.3 Lexical Semantic Typing and Coherence </SectionTitle> <Paragraph position="0"> Lexical semantic types as well as a measure of lexical similarity or coherence between words in two discourse segments would appear to be useful for assigning an appropriate discourse relationship. Resemblance relations, in particular, require similar entities to be involved and lexical similarity here serves as an approximation to definite nominal coreference. Identification of lexical relationships between words across segments appears especially useful for cause-effect relations.</Paragraph> <Paragraph position="1"> In example (3) above, determining a (potential) cause-effect relationship between crash and injury is necessary to identify the discourse relation.</Paragraph> <Paragraph position="2"> Lexical similarity was computed using the Word Sketch Engine (WSE) (Killgarrif et al., 2004) similarity metric applied over British National Corpus. The WSE similarity metric implements the word similarity measure based on grammatical relations as defined in (Lin, 1998) with minor modifications.</Paragraph> <Paragraph position="3"> As a second source of lexical coherence, we used the Brandeis Semantic Ontology or BSO (Pustejovsky et al., 2006). The BSO is a lexically-based ontology in the Generative Lexicon tradition (Pustejovsky, 2001; Pustejovsky, 1995). It focuses on contextualizing the meanings of words and does this by a rich system of types and qualia structures. For example, if one were to look up the phrase RED WINE in the BSO, one would find its type is WINE and its type's type is ALCOHOLIC BEVERAGE. The BSO contains ontological qualia information (shown below). Using the BSO, one is able to find out where in the ontological type system WINE is located, what RED WINE's lexical neighbors are, and its full set of part of speech and grammatical attributes. Other words have a different configuration of annotated attributes depending on the type of the word.</Paragraph> <Paragraph position="4"> We used the BSO typing information to semantically tag individual words in order to compute lexical paths between word pairs. Such lexical associations are invoked when constructing cause-effect relations and other implicatures (e.g. between crash and injure in Example 3).</Paragraph> <Paragraph position="5"> The type system paths provide a measure of the connectedness between words. For every pair of head words in a GraphBank document, the shortest path between the two words within the BSO is computed. Currently, this metric only uses the type system relations (i.e., inheritance) but preliminary tests show that including qualia relations as connections is promising. We also computed the earliest common ancestor of the two words. These metrics are calculated for every possible sense of the word within the BSO.</Paragraph> <Paragraph position="6"> The use of the BSO is advantageous compared to other frameworks such as Wordnet because it focuses on the connection between words and their semantic relationship to other items. These connections are captured in the qualia information and the type system. In Wordnet, qualia-like information is only present in the glosses, and they do not provide a definite semantic path between any two lexical items. Although synonymous in some ways, synset members often behave differently in many situations, grammatical or otherwise.</Paragraph> </Section> </Section> <Section position="7" start_page="120" end_page="120" type="metho"> <SectionTitle> 5 Classification Methodology </SectionTitle> <Paragraph position="0"> This section describes in detail how we constructed features from the various knowledge sources described above and how they were encoded in a Maximum Entropy model.</Paragraph> <Section position="1" start_page="120" end_page="120" type="sub_section"> <SectionTitle> 5.1 Maximum Entropy Classification </SectionTitle> <Paragraph position="0"> For our experiments of classifying relation types, we used a Maximum Entropy classifier5 in order to assign labels to each pair of discourse segments connected by some relation. For each instance (i.e.</Paragraph> <Paragraph position="1"> pair of segments) the classifier makes its decision based on a set of features. Each feature can query some arbitrary property of the two segments, possibly taking into account external information or knowledge sources. For example, a feature could query whether the two segments are adjacent to each other, whether one segment contains a discourse connective, whether they both share a particular word, whether a particular syntactic construction or lexical association is present, etc. We make strong use of this ability to include very many, highly interdependent features6 in our experiments. Besides binary-valued features, feature values can be real-valued and thus capture frequencies, similarity values, or other scalar quantities. null</Paragraph> </Section> <Section position="2" start_page="120" end_page="120" type="sub_section"> <SectionTitle> 5.2 Feature Classes </SectionTitle> <Paragraph position="0"> We grouped the features together into various feature classes based roughly on the knowledge source from which they were derived. Table 1 describes the various feature classes in detail and provides some actual example features from each class for the segment pair described in Example 5 in Section 3.2.</Paragraph> </Section> </Section> <Section position="8" start_page="120" end_page="122" type="metho"> <SectionTitle> 6 Experiments and Results </SectionTitle> <Paragraph position="0"> In this section we provide the results of a set of experiments focused on the task of discourse relation classification. We also report initial results on relation identification with the same set of features as used for classification.</Paragraph> <Section position="1" start_page="120" end_page="121" type="sub_section"> <SectionTitle> 6.1 Discourse Relation Classification </SectionTitle> <Paragraph position="0"> The task of discourse relation classification involves assigning the correct label to a pair of discourse segments.7 The pair of segments to assign a relation to is provided (from the annotated data).</Paragraph> <Paragraph position="1"> In addition, we assume, for asymmetric links, that the nucleus and satellite are provided (i.e., the direction of the relation). For the elaboration relations, we ignored the annotated subtypes (person, time, location, etc.). Experiments were carried out on the full set of relation types as well as the simpler set of coarse-grained relation categories described in Section 3.1.</Paragraph> <Paragraph position="2"> The GraphBank contains a total of 8755 annotated coherence relations. 8 For all the experiments in this paper, we used 8-fold cross-validation with 12.5% of the data used for testing and the remainder used for training for each fold. Accuracy numbers reported are the average accuracies over the 8 folds. Variance was generally low with a standard deviation typically in the range of 1.5 to 2.0. We note here also that the inter-annotator agreement between the two GraphBank annotators was 94.6% for relations when they agreed on the presence of a relation. The majority class baseline (i.e., the accuracy achieved by calling all relations elaboration) is 45.7% (and 66.57% with the collapsed categories). These are the upper and lower bounds against which these results should be based.</Paragraph> <Paragraph position="3"> To ascertain the utility of each of the various feature classes, we considered each feature class independently by using only features from a single class in addition to the Proximity feature class which serve as a baseline. Table 2 illustrates the result of this experiment.</Paragraph> <Paragraph position="4"> We performed a second set of experiments shown in Table 3 that is essentially the converse of the previous batch. We take the union of all the</Paragraph> </Section> <Section position="2" start_page="121" end_page="121" type="sub_section"> <SectionTitle> Feature Description Example Class </SectionTitle> <Paragraph position="0"> C Words appearing at beginning and end of the two discourse segments - these are often important discourse cue words.</Paragraph> <Paragraph position="1"> first1-is-to; first2-is-The P Proximity and direction between the two segments (in terms of segments) - binary features such as distance less than 3, distance greater than 10 were used in addition to the distance value itself; the distance from beginning of the document using a similar binning approach adjacent; dist-less-than-3; dist-lessthan-5; direction-reverse; samesentence BSO Paths in the BSO up to length 10 between non-function words in the two segments.</Paragraph> </Section> <Section position="3" start_page="121" end_page="121" type="sub_section"> <SectionTitle> ResearchLab a23 EducationalActivity a23 University </SectionTitle> <Paragraph position="0"> WSE WSE word-pair similarities between words in the two segments were binned as (a24 0.05, a24 0.1, a24 0.2). We also computed sentence similarity as the sum of the word similarities divided by the sum of their sentence lengths.</Paragraph> <Paragraph position="1"> Syntax Grammatical dependency relations between two segments as identified by the RASP parser. We also conjoined the relation with one or both of the headwords associated with the grammatical relation. gr-ncmod; gr-ncmod-head1-equipment; gr-ncmod-head-2-spent; etc.</Paragraph> <Paragraph position="2"> Tlink Temporal links between events in the two segments. We included both the link types and the number of occurrences of those types between the segments coarse-grained relation types with each feature class added to Proximity feature class.</Paragraph> <Paragraph position="3"> feature classes and perform ablation experiments by removing one feature class at a time.</Paragraph> <Paragraph position="4"> ture class removed from the union of all feature classes.</Paragraph> </Section> <Section position="4" start_page="121" end_page="122" type="sub_section"> <SectionTitle> 6.2 Analysis </SectionTitle> <Paragraph position="0"> From the ablation results, it is clear that overall performance is most impacted by the cue-word features (C) and proximity (P). Syntax and SlinkET also have high impact improving accuracy by roughly 10 and 9 percent respectively as shown in Table 2. From the ablation results in Table 3, it is clear that the utility of most of the individual features classes is lessened when all the other feature classes are taken into account. This indicates that multiple feature classes are responsible for providing evidence any given discourse relations. Removing a single feature class degrades performance, but only slightly, as the others can compensate.</Paragraph> <Paragraph position="1"> Overall precision, recall and F-measure results for each of the different link types using the set of all feature classes are shown in Table 4 with the corresponding confusion matrix in Table A.1. Performance correlates roughly with the frequency of the various relation types. We might therefore expect some improvement in performance with more annotated data for those relations with low frequency in the GraphBank.</Paragraph> </Section> <Section position="5" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 6.3 Coherence Relation Identification </SectionTitle> <Paragraph position="0"> The task of identifying the presence of a relation is complicated by the fact that we must consider all a25a27a26 a11a29a28 potential relations where a30 is the number of segments. This presents a troublesome, highly-skewed binary classification problem with a high proportion of negative instances. Furthermore, some of the relations, particularly the resemblance relations, are transitive in nature (e.g. a31a33a32a35a34a13a32a37a36a38a36a39a8a12a36a41a40a39a42a12a43a45a44a46a42a48a47a50a49a52a51a53a31a54a32a55a34a10a32a37a36a56a36a39a8a14a36a57a40a39a42a58a47a13a44a46a42a60a59a13a49a62a61 a31a33a32a35a34a10a32a35a36a38a36a39a8a12a36a41a40a39a42a14a43a45a44a46a42a60a59a50a49 ). However, these transitive links are not provided in the GraphBank annotation such segment pairs will therefore be presented incorrectly as negative instances to the learner, making this approach infeasible. An initial experiment considering all segment pairs, in fact, resulted in performance only slightly above the majority class baseline.</Paragraph> <Paragraph position="1"> Instead, we consider the task of identifying the presence of discourse relations between segments within the same sentence. Using the same set of all features used for relation classification, performance is at 70.04% accuracy. Simultaneous identification and classification resulted in an accuracy of 64.53%. For both tasks the baseline accuracy was 58%.</Paragraph> </Section> <Section position="6" start_page="122" end_page="122" type="sub_section"> <SectionTitle> 6.4 Modeling Inter-relation Dependencies </SectionTitle> <Paragraph position="0"> Casting the problem as a standard classification problem where each instance is classified independently, as we have done, is a potential drawback. In order to gain insight into how collective, dependent modeling might help, we introduced additional features that model such dependencies: For a pair of discourse segments, a42a12a43 and a42a58a47 , to classify the relation between, we included features based on the other relations involved with the two segments (from the gold standard annotations): a63a14a64a65a40a39a42 a43 a44a46a42a14a59a10a49a67a66a68a70a69a71a73a72a75a74 and a63a14a64a65a40a39a42 a47 a44a46a42a14a76a27a49a67a66a36a77a69a71a79a78a41a74 . Adding these features improved classification accuracy to 82.3%. This improvement is fairly significant (a 6.3% reduction in error) given that this dependency information is only encoded weakly as features and not in the form of model constraints. null</Paragraph> </Section> </Section> class="xml-element"></Paper>