File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/c04-1061_intro.xml

Size: 4,302 bytes

Last Modified: 2025-10-06 14:02:05

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1061">
  <Title>Machine-Assisted Rhetorical Structure Annotation</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A number of approaches tackling the di cult problem of automatic discourse parsing have been proposed in recent years (e.g., (Sumita et al., 1992) (Marcu, 1997), (Schilder, 2002)).</Paragraph>
    <Paragraph position="1"> They di er in their orientation toward symbolic or statistical information, but they all | quite naturally  |share the assumption that the lexical connectives or discourse markers are the primary source of information for constructing a rhetorical tree automatically. The density of discourse markers in a text depends on its genre (e.g., commentaries tend to have more than narratives), but in general, it is clear that only a portion of the relations holding in a text is lexically signalled.1 Furthermore, it is well-known that discourse markers are often ambiguous; for example, the English but can, in terms of (Mann, Thompson, 1988), signal any of the relations Antithesis, Contrast, and Concession.</Paragraph>
    <Paragraph position="2"> Accordingly, automatic discourse parsing focusing on connectives is bound to have its limitations. null 1In our corpus of newspaper commentaries (Stede, 2004), we found that 35% of the coherence relations are signalled by a connective.</Paragraph>
    <Paragraph position="3"> Our position is that progress in discourse parsing relies on the one hand on a more thorough understanding of the underlying issues, and on the other hand on the availability of human-annotated corpora, which can serve as a resource for in-depth studies of discoursestructural phenomena, and also for training statistical analysis programs. Two examples of such corpora are the RST Tree Corpus by (Marcu et al., 1999) for English and the Potsdam Commentary Corpus (Stede, 2004) for German. Producing such resources is a labour-intensive task that requires time, trained annotators, and clearly speci ed guidelines on what relation to choose under which circumstances.</Paragraph>
    <Paragraph position="4"> Nonetheless, rhetorical analysis remains to be in part a rather subjective process (see section 2). In order to eventually arrive at more objective, comparable results, our proposal is to split the annotation process into two parts:  1. Annotation of connectives, their scopes (the two related textual units), and  |optionally  |the signalled relation 2. Annotation of the remaining (unsignalled)  relations between larger segments Step 1 is inspired by work done for English in the Penn Discourse TreeBank2 (Miltsakaki et al., 2004). In our two-step scenario, it is the easier part of the whole task in that connectives can be quite clearly identi ed, their scopes are often (but not always, see below) transparent, and the coherence relation is often clear. We see the result of step 1 as a corpus resource in its own right (it can be used for training statistical classi ers, for instance) and at the same time as the input for step 2, which \ lls the gaps&amp;quot;: now annotators have to decide how the set of small trees produced in step 1 is best arranged in one complete tree, which involves assigning  relations to instances without any lexical signals and also making more complicated scope judgements across larger spans of text  |the more subjective and also more time-consuming step.3 Our approach is as follows. To speed up the annotation process in step 1, we have developed an XML format and a dedicated analysis tool called ConAno, which will be introduced in Section 4. ConAno can export the annotated text in the 'rs3' format that serves as input to O'Donnell's RST Tool (O'Donnell, 1997). His original idea was that manual annotation be done completely with his tool; we opted however to use it only for step 2, and will motivate the reasons for this overall architecture in Section  The net result is a modular, XML-based annotation environment for machine-assisted rhetorical analysis, which we see as on the one hand less ambitious than fully-automatic discourse parsing and on the other hand as more e cient than completely 'manual' analysis.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML