File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1061_metho.xml
Size: 14,797 bytes
Last Modified: 2025-10-06 14:08:41
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1061"> <Title>Machine-Assisted Rhetorical Structure Annotation</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Approaches to rhetorical analysis </SectionTitle> <Paragraph position="0"> There are two di erent perspectives on the task of discourse parsing: an \ideal&quot; one that aims at modelling a systematic, incremental process; and an \empirical&quot; one that takes the experiences of human annotators into account. \Ideally&quot;, discourse analysis proceeds incrementally from left to right, where for each new segment, an attachment point and a relation (or more than one of each, cf. SDRT) are computed and the discourse structure grows step by step. This view is taken for instance in SDRT (Asher, Lascarides, 2003), which places emphasis on the notion of 'right frontier' (also discussed recently by (Webber et al., 2003)).</Paragraph> <Paragraph position="1"> However, when we trained two (experienced) students to annotate the 171 newspaper commentaries of the Potsdam Commentary Corpus (Stede, 2004) and upon completion of the task asked them about their experiences, a very different picture emerged. Both annotators agreed that a strict left-to-right approach is highly impractical, because the intended argumentative structure of the text often becomes clear only in retrospect, after re ecting the possible contributions of the segments to the larger scheme.</Paragraph> <Paragraph position="2"> 3This assessment of relative di culty does not carry over to PDTB, where the annotations are more complex than in our step 1 but do not go as far as building rhetorical structures.</Paragraph> <Paragraph position="3"> Thus they very soon settled on a bottom-up approach: First, mark the transparent cases, in which a connective undoubtedly signals a relation between two segments.4 Then, see how the resulting pieces t together into a structure that mirrors the argument presented.</Paragraph> <Paragraph position="4"> The annotators used RST Tool (O'Donnell, 1997), which worked reasonably well for the purpose. However, since we also have in our group an XML-based lexicon of German connectives at our disposal (Berger et al., 2002), why not use this resource to speed up the rst phase of the annotation?</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Annotating connectives and their </SectionTitle> <Paragraph position="0"> scopes In our de nition of 'connective', we largely follow (Pasch et al., 2003) (a substantial catalogue and classi cation of German connectives), who require them to take two arguments that can potentially be full clauses and that semantically denote two-place relations between eventualities (but they need not always be spelled out as clauses). From the syntactic viewpoint, they are a rather inhomogeneous group consisting of subordinating and coordinating conjunctions, some prepositions, and a number of sententence adverbials. We refer to the two related units as an 'internal' and an 'external' one, where the 'internal' one is the unit of which the connective is actually a part. For example, in Despite the heavy rain we had a great time, the noun phrase the heavy rain is the internal unit, since it forms a syntactic phrase together with the preposition. Notice that this is a case where the eventuality (a state of weather) is not made explicit by a verb.</Paragraph> <Paragraph position="1"> As indicated, this step of annotating connectives and units is closely related to the idea of the PDTB project, which seeks to develop a large corpus annotated with information on discourse structure for English texts. For this purpose, annotators are provided with detailed annotation guidelines, which point out various challenges in the annotation process for explicit as well as empty connectives and their respective arguments. They include, among others, words/phrases that look like connectives, but prove not to take two propositional arguments null 4The clearest cases are subjunctors, which always mark a relation between matrix clause and embedded clause.</Paragraph> <Paragraph position="2"> words/phrases as preposed predicate complements null pre- and post-modi ed connectives co-occurring connectives single and multiple clauses/sentences as arguments of connectives annotation of discontinuous connective arguments null Annotators have to also make syntactic judgements, which is not the case in our approach (where syntax would be done on a di erent annotation layer, see (Stede, 2004)).</Paragraph> <Paragraph position="3"> In the following, we brie y explain the most important problematic issues with annotating German connectives and the way we deal with them, using our annotation scheme for ConAno. null</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Issues with German connectives </SectionTitle> <Paragraph position="0"> Connective or not: Some words can be used as connective or in other functions, such as und ('and'), which can for example conjoin clauses (connective) or items in a list (no connective).</Paragraph> <Paragraph position="1"> Which relation: Some connectives can signal more than one relation, as the above-mentioned but and its German counterpart aber. Complex connectives: Connectives can be phrasal (e.g., aus diesem Grund, 'for this reason') or even discontinuous (e.g., entweder ...oder, 'either . ..or'). A fortiori, some may be used in more than one order (wenn A, dann B / dann B, wenn A / dann, wenn A, B; 'if .. .then . ..').</Paragraph> <Paragraph position="2"> Multiple connectives/relations: Some connectives can be joined to form a complex one, which might then signal more than one relation (e.g., combinations with und and aber, such as aber dennoch, 'but still').</Paragraph> <Paragraph position="3"> Modi ed connectives: Some but not all connectives are subject to modi cation (e.g., nur dann, wenn, 'only then, if'; besonders weil, 'especially because').</Paragraph> <Paragraph position="4"> Embedded segments: The minimal units linked by the connective may be embedded rather than adjacent: Wir m ussen, weil die Zeit dr angt, uns Montag tre en ('We have to, because time is short, meet on Monday').</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 A DTD and an Example </SectionTitle> <Paragraph position="0"> As the rst step toward an annotation tool, we de ned an XML format for texts with connectives and their scopes. Figure 1 shows the DTD, and Figure 2 a short sample annotation of a single |yet complex |sentence: Auch Berlin koennte, jedenfalls dann, wenn der Bund sich erkenntlich zeigt, um die Notlage seiner Hauptstadt zu lindern, davon pro tieren. ('Berlin, too, could { at least if the federation shows some gratitude in order to alleviate the emergency of its capital { pro t from it.') The DTD introduces XML tags for each of the connectives (<connective>), their possible modi ers (<modifier>) and respective discourse units (<unit>, where the type can be lled by int or ext), as well as the entire text (<discourse>).</Paragraph> <Paragraph position="1"> Henceforth, we will refer to the text unit containing the connective as the internal, 'int-unit' and to the other, external, one as 'ext-unit'. Using this DTD, it is possible to represent the range of problematic phenomena discussed in the previous section.</Paragraph> <Paragraph position="2"> Connective or not: Only those words actually used as connectives will be marked with the <connective> tag, while others such as the frequently occurring und ('and') or oder ('or') will remain unmarked, if they merely conjoin items in a list.</Paragraph> <Paragraph position="3"> Which relation: The <connective> tag includes a rel attribute for optional speci cation of the rhetorical relation that holds between the connected clauses.</Paragraph> <Paragraph position="4"> Complex connectives: Using an XML based annotation scheme, we can easily mark phrasal connectives such as aus diesem Grund ('for this reason') using the <connective> tag.</Paragraph> <Paragraph position="5"> In order for discontinuous connectives to be annotated correctly, we introduce an id attribute that provides every connective with a distinct reference number. This way connectives such as entweder ...oder, ('either . .. or') can be represented as belonging together. (see <connective id=&quot;4&quot; rel=&quot;condition&quot;> tags in Figure 2) Multiple connectives/relations: In our annotation scheme, complex connectives such as aber dennoch, ('but still') are treated as two distinct connectives that indicate di erent relations holding between the same units.</Paragraph> <Paragraph position="6"> Modi ed connectives: Connective modiers are marked with a special <modifier> tag, which is embedded inside the <connective> tag, as shown with jedenfalls modifying dann in our example. Hence an additional id attribute for this tag is not necessary.</Paragraph> <Paragraph position="7"> Embedded segments: Discourse units are marked using the <unit> tag, which also provides an id attribute. On the one hand, this is used for assigning discourse units to their respective connectives, on the other hand it provides a way of dealing with discountinuous discourse units, as the example shows.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 The ConAno annotation tool </SectionTitle> <Paragraph position="0"> A range of relatively generic linguistic annotation tools are available today, but none of them turned out suitable for our purposes: We seek a very easy-to-use, platform-independent tool for mouse-marking ranges of text and having them associated with one another. Consequently, we decided to implement our own Java-based tool, ConAno, which is geared especially to connective/scope annotation and thus can have a very intuitive user interface.</Paragraph> <Paragraph position="1"> Just like discourse parsers do, ConAno exploits the fact that connectives are the most reliable source of information. Rather than attempting an automatic disambiguation, however, ConAno merely makes suggestions to the human analyst, which she might follow or discard. In particular, ConAno loads a list of (potential) connectives, and when annotation of a text begins, highlights each candidate so that the user can either con rm (by marking its scope) or discard (by mouse-click) it if it not used as a connective. Furthermore, the connective list optionally may contain associated coherence relations, which are then also o ered to the user for selection. This annotation phase is thus purely data-driven: Attention is paid only to the connectives and their speci c relation candidates.</Paragraph> <Paragraph position="2"> To elaborate a little, the annotation process proceeds as follows. The text is loaded into the annotation window, and the rst potential connective is automatically highlighted. Potential preposed or postposed modi ers, if any, of the connective are also highlighted (in a di erent color). The user moves with the mouse from one connective to the next and can with a mouseclick discard a highlighted item (it is not a connective or not a modier), null can call up a help window explaining the syntactic behavior and the relations of this connective, can call up a suggestion for the int-unit (i.e., text portion is highlighted), can analogously call up a suggestion for the ext-unit, can choose from a menu of the relations associated with this connective.</Paragraph> <Paragraph position="3"> A screenshot is given in Figure 4. The suggestions for int-unit and ext-unit are made by ConAno on the basis of the syntactic category of the connective; we use simple rules like \search up to the next comma&quot; to nd the likely int-unit for a subjunctor, or \search the preceding two full-stops&quot; to nd the ext-unit for an adverbial (the preceding sentence). The suggestions may be wrong; then the user discards them and marks them with the mouse herself. The result of this annotation phase is an XML le like the (very short) one shown in Figure 2.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Overall annotation environment </SectionTitle> <Paragraph position="0"> A central design objective is to keep the environment neutral with respect to the languages of the text, the connectives to be annotated, and the coherence relations associated with them.</Paragraph> <Paragraph position="1"> Accordingly, the list of connectives is external and read into ConAno upon startup. In our case, we use an XSLT sheet to map our 'Discourse Marker Lexicon' (see below) to the input format of ConAno. The text to be annotated is expected in plain ASCII. When annotation is complete, the result (or an intermediate result) can be saved in our XML-format introduced in section 3.2. Optionally, it can be exported to the 'rs3' format developed by (O'Donnell, 1997) for his RSTTool. This allows for a smooth tran- null the RSTTool user can now open the le produced by ConAno, which amounts to a partial rhetorical analysis of the text, and which the user can now complete to a full tree.</Paragraph> <Paragraph position="2"> Our Discourse Marker Lexikon 'DiMLex' (Berger et al., 2002) assembles information on 140 German connectives, giving a range of syntactic, semantic, and pragmatic features, including the coherence relations along the lines of (Mann, Thompson, 1988). They are encoded in an application-neutral XML format (see Figure 5), which are mapped with XSLT sheets to various NLP applications. Our new proposal here is to use it also for interactive connective annotation. Hence, we wrote an XSLT sheet that maps DiMLex to a reduced list, where each connective is associated with syntactic labels coordination, subordination or adverb and <coh-relation> entries for its potential relations |see Figure 6 for DTD and 7 for an example. The, for these purposes quite simple, syn value has been mapped from the more complex classi cation in DiMLex under kat (German for category). This format is the input to ConAno.</Paragraph> <Paragraph position="3"> As indicated above, we do not see the transition to RSTTool as a necessary step. Rather, the intermediate result of connective/scope annotation is useful in its own right, as it encodes those aspects of rhetorical structure that are independent of the chosen set of coherence relations and the conditions of assigning them.</Paragraph> </Section> class="xml-element"></Paper>