File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/w01-1605_intro.xml
Size: 9,550 bytes
Last Modified: 2025-10-06 14:01:16
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1605"> <Title>Building a Discourse-Tagged Corpus in the Framework of Rhetorical Structure Theory</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Framework </SectionTitle> <Paragraph position="0"> Two principle goals underpin the creation of this discourse-tagged corpus: 1) The corpus should be grounded in a particular theoretical approach, and 2) it should be sufficiently large enough to offer potential for wide-scale use - including linguistic analysis, training of statistical models of discourse, and other computational linguistic applications. These goals necessitated a number of constraints to our approach. The theoretical framework had to be practical and repeatable over a large set of documents in a reasonable amount of time, with a significant level of consistency across annotators. Thus, our approach contributes to the community quite differently from detailed analyses of specific discourse phenomena in depth, such as anaphoric relations (Garside et al., 1997) or style types (Leech et al., 1997); analysis of a single text from multiple perspectives (Mann and Thompson, 1992); or illustrations of a theoretical model on a single representative text (Britton and Black, 1985; Van Dijk and Kintsch, 1983).</Paragraph> <Paragraph position="1"> Our annotation work is grounded in the Rhetorical Structure Theory (RST) framework (Mann and Thompson, 1988). We decided to use RST for three reasons: * It is a framework that yields rich annotations that uniformly capture intentional, semantic, and textual features that are specific to a given text.</Paragraph> <Paragraph position="2"> * Previous research on annotating texts with rhetorical structure trees (Marcu et al., 1999) has shown that texts can be annotated by multiple judges at relatively high levels of agreement. We aimed to produce annotation protocols that would yield even higher agreement figures.</Paragraph> <Paragraph position="3"> * Previous research has shown that RST trees can play a crucial role in building natural language generation systems (Hovy, 1993; Moore and Paris, 1993; Moore, 1995) and text summarization systems (Marcu, 2000); can be used to increase the naturalness of machine translation outputs (Marcu et al.</Paragraph> <Paragraph position="4"> 2000); and can be used to build essayscoring systems that provide students with discourse-based feedback (Burstein et al., 2001). We suspect that RST trees can be exploited successfully in the context of other applications as well.</Paragraph> <Paragraph position="5"> In the RST framework, the discourse structure of a text can be represented as a tree defined in terms of four aspects: * The leaves of the tree correspond to text fragments that represent the minimal units of the discourse, called elementary discourse units * The internal nodes of the tree correspond to contiguous text spans * Each node is characterized by its nuclearity - a nucleus indicates a more essential unit of information, while a satellite indicates a supporting or background unit of information.</Paragraph> <Paragraph position="6"> * Each node is characterized by a rhetorical relation that holds between two or more non-overlapping, adjacent text spans.</Paragraph> <Paragraph position="7"> Relations can be of intentional, semantic, or textual nature.</Paragraph> <Paragraph position="8"> Below, we describe the protocol that we used to build consistent RST annotations.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Segmenting Texts into Units </SectionTitle> <Paragraph position="0"> The first step in characterizing the discourse structure of a text in our protocol is to determine the elementary discourse units (EDUs), which are the minimal building blocks of a discourse tree. Mann and Thompson (1988, p. 244) state that &quot;RST provides a general way to describe the relations among clauses in a text, whether or not they are grammatically or lexically signalled.&quot; Yet, applying this intuitive notion to the task of producing a large, consistently annotated corpus is extremely difficult, because the boundary between discourse and syntax can be very blurry. The examples below, which range from two distinct sentences to a single clause, all convey essentially the same meaning, packaged in different ways: 1. [Xerox Corp.'s third-quarter net income grew 6.2% on 7.3% higher revenue.] [This earned mixed reviews from Wall Street analysts.] 2. [Xerox Corp's third-quarter net income grew 6.2% on 7.3% higher revenue,] [which earned mixed reviews from Wall Street analysts.] 3. [Xerox Corp's third-quarter net income grew 6.2% on 7.3% higher revenue,] [earning mixed reviews from Wall Street analysts.] 4. [The 6.2% growth of Xerox Corp.'s third null quarter net income on 7.3% higher revenue earned mixed reviews from Wall Street analysts.] In Example 1, there is a consequential relation between the first and second sentences. Ideally, we would like to capture that kind of rhetorical information regardless of the syntactic form in which it is conveyed. However, as examples 2-4 illustrate, separating rhetorical from syntactic analysis is not always easy. It is inevitable that any decision on how to bracket elementary discourse units necessarily involves some compromises.</Paragraph> <Paragraph position="1"> Reseachers in the field have proposed a number of competing hypotheses about what constitutes an elementary discourse unit. While some take the elementary units to be clauses (Grimes, 1975; Givon, 1983; Longacre, 1983), others take them to be prosodic units (Hirschberg and Litman, 1993), turns of talk (Sacks, 1974), sentences (Polanyi, 1988), intentionally defined discourse segments (Grosz and Sidner, 1986), or the &quot;contextually indexed representation of information conveyed by a semiotic gesture, asserting a single state of affairs or partial state of affairs in a discourse world,&quot; (Polanyi, 1996, p.5). Regardless of their theoretical stance, all agree that the elementary discourse units are non-overlapping spans of text.</Paragraph> <Paragraph position="2"> Our goal was to find a balance between granularity of tagging and ability to identify units consistently on a large scale. In the end, we chose the clause as the elementary unit of discourse, using lexical and syntactic clues to help determine boundaries: and resources company, is selling many of its assets] [to reduce its debts.]</Paragraph> <Paragraph position="4"> However, clauses that are subjects, objects, or complements of a main verb are not treated as EDUs: 7. [Making computers smaller often means sacrificing memory.] 8. [Insurers could see claims totaling nearly $1 billion from the San Francisco</Paragraph> <Paragraph position="6"> Relative clauses, nominal postmodifiers, or clauses that break up other legitimate EDUs, are treated as embedded discourse units: 9. [The results underscore Sears's difficulties] [in implementing the &quot;everyday low pricing&quot; strategy...] wsj_1105 10. [The Bush Administration,] [trying to blunt growing demands from Western Europe for a relaxation of controls on exports to the Soviet bloc,] [is questioning...] wsj_2326 Finally, a small number of phrasal EDUs are allowed, provided that the phrase begins with a strong discourse marker, such as because, in spite of, as a result of, according to. We opted for consistency in segmenting, sacrificing some potentially discourse-relevant phrases in the process.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Building up the Discourse Structure </SectionTitle> <Paragraph position="0"> Once the elementary units of discourse have been determined, adjacent spans are linked together via rhetorical relations creating a hierarchical structure. Relations may be mononuclear or multinuclear. Mononuclear relations hold between two spans and reflect the situation in which one span, the nucleus, is more salient to the discourse structure, while the other span, the satellite, represents supporting information. Multinuclear relations hold among two or more spans of equal weight in the discourse structure. A total of 53 mononuclear and 25 multinuclear relations were used for the tagging of the RST Corpus. The final inventory of rhetorical relations is data driven, and is based on extensive analysis of the corpus.</Paragraph> <Paragraph position="1"> Although this inventory is highly detailed, annotators strongly preferred keeping a higher level of granularity in the selections available to them during the tagging process. More extensive analysis of the final tagged corpus will demonstrate the extent to which individual relations that are similar in semantic content were distinguished consistently during the tagging process.</Paragraph> <Paragraph position="2"> The 78 relations used in annotating the corpus can be partitioned into 16 classes that share some type of rhetorical meaning: Change. For example, the class Explanation includes the relations evidence, explanationargumentative, and reason, while Topic-Comment includes problem-solution, questionanswer, statement-response, topic-comment, and comment-topic. In addition, three relations are used to impose structure on the tree: textualorganization, span, and same-unit (used to link parts of units separated by embedded units or spans).</Paragraph> </Section> </Section> class="xml-element"></Paper>