File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/p02-1047_metho.xml
Size: 25,205 bytes
Last Modified: 2025-10-06 14:07:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P02-1047"> <Title>An Unsupervised Approach to Recognizing Discourse Relations</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In the field of discourse research, it is now widely agreed that sentences/clauses are usually not understood in isolation, but in relation to other sentences/clauses. Given the high level of interest in explaining the nature of these relations and in providing definitions for them (Mann and Thompson, 1988; Hobbs, 1990; Martin, 1992; Lascarides and Asher, 1993; Hovy and Maier, 1993; Knott and Sanders, 1998), it is surprising that there are no robust programs capable of identifying discourse relations that hold between arbitrary spans of text. Consider, for example, the sentence/clause pairs below.</Paragraph> <Paragraph position="1"> a. Such standards would preclude arms sales to states like Libya, which is also currently sub-ject to a U.N. embargo.</Paragraph> <Paragraph position="2"> b. But states like Rwanda before its present crisis would still be able to legally buy arms.</Paragraph> <Paragraph position="3"> (1) a. South Africa can afford to forgo sales of guns and grenades b. because it actually makes most of its profits from the sale of expensive, high-technology systems like laser-designated missiles, aircraft electronic warfare systems, tactical radios, anti-radiation bombs and battlefield mobility systems.</Paragraph> <Paragraph position="4"> (2) In these examples, the discourse markers But and because help us figure out that a CONTRAST relation holds between the text spans in (1) and an EXPLANATION-EVIDENCE relation holds between the spans in (2). Unfortunately, cue phrases do not signal all relations in a text. In the corpus of Rhetorical Structure trees (www.isi.edu/a2 marcu/discourse/) built by Carlson et al. (2001), for example, we have observed that only 61 of 238 CONTRAST relations and 79 out of 307 EXPLANATION-EVIDENCE relations that hold between two adjacent clauses were marked by a cue phrase.</Paragraph> <Paragraph position="5"> So what shall we do when no discourse markers are used? If we had access to robust semantic interpreters, we could, for example, infer from sentence 1.a that &quot;cannot buy arms legally(libya)&quot;, infer from sentence 1.b that &quot;can buy arms legally(rwanda)&quot;, use our background knowledge in order to infer that &quot;similar(libya,rwanda)&quot;, and apply Hobbs's (1990) definitions of discourse relations to arrive at the conclusion that a CONTRAST relation holds between the sentences in (1). Unfortunately, the state of the art in NLP does not provide us access to semantic interpreters and general purpose knowledge bases that would support these kinds of inferences.</Paragraph> <Paragraph position="6"> The discourse relation definitions proposed by Computational Linguistics (ACL), Philadelphia, July 2002, pp. 368-375. Proceedings of the 40th Annual Meeting of the Association for others (Mann and Thompson, 1988; Lascarides and Asher, 1993; Knott and Sanders, 1998) are not easier to apply either because they assume the ability to automatically derive, in addition to the semantics of the text spans, the intentions and illocutions associated with them as well.</Paragraph> <Paragraph position="7"> In spite of the difficulty of determining the discourse relations that hold between arbitrary text spans, it is clear that such an ability is important in many applications. First, a discourse relation recognizer would enable the development of improved discourse parsers and, consequently, of high performance single document summarizers (Marcu, 2000). In multidocument summarization (DUC, 2002), it would enable the development of summarization programs capable of identifying contradictory statements both within and across documents and of producing summaries that reflect not only the similarities between various documents, but also their differences. In question-answering, it would enable the development of systems capable of answering sophisticated, non-factoid queries, such as &quot;what were the causes of X?&quot; or &quot;what contradicts Y?&quot;, which are beyond the state of the art of current systems (TREC, 2001).</Paragraph> <Paragraph position="8"> In this paper, we describe experiments aimed at building robust discourse-relation classification systems. To build such systems, we train a family of Naive Bayes classifiers on a large set of examples that are generated automatically from two corpora: a corpus of 41,147,805 English sentences that have no annotations, and BLIPP, a corpus of 1,796,386 automatically parsed English sentences (Charniak, 2000), which is available from the Linguistic Data Consortium (www.ldc.upenn.edu). We study empirically the adequacy of various features for the task of discourse relation classification and we show that some discourse relations can be correctly recognized with accuracies as high as 93%.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Discourse relation definitions and </SectionTitle> <Paragraph position="0"> generation of training data</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Background </SectionTitle> <Paragraph position="0"> In order to build a discourse relation classifier, one first needs to decide what relation definitions one is going to use. In Section 1, we simply relied on the reader's intuition when we claimed that a CONTRAST relation holds between the sentences in (1).</Paragraph> <Paragraph position="1"> In reality though, associating a discourse relation with a text span pair is a choice that is clearly influenced by the theoretical framework one is willing to adopt.</Paragraph> <Paragraph position="2"> If we adopt, for example, Knott and Sanders's (1998) account, we would say that the relation between sentences 1.a and 1.b is ADDITIVE, because no causal connection exists between the two sentences, PRAGMATIC, because the relation pertains to illocutionary force and not to the propositional content of the sentences, and NEGATIVE, because the relation involves a CONTRAST between the two sentences. In the same framework, the relation between clauses 2.a and 2.b will be labeled as CAUSAL-SEMANTIC-POSITIVE-NONBASIC. In Lascarides and Asher's theory (1993), we would label the relation between 2.a and 2.b as EXPLANATION because the event in 2.b explains why the event in 2.a happened (perhaps by CAUSING it). In Hobbs's theory (1990), we would also label the relation between 2.a and 2.b as EXPLANATION because the event asserted by 2.b CAUSED or could CAUSE the event asserted in 2.a. And in Mann and Thompson theory (1988), we would label sentence pairs 1.a, 1.b as CONTRAST because the situations presented in them are the same in many respects (the purchase of arms), because the situations are different in some respects (Libya cannot buy arms legally while Rwanda can), and because these situations are compared with respect to these differences. By a similar line of reasoning, we would label the relation between 2.a and 2.b as EVIDENCE.</Paragraph> <Paragraph position="3"> The discussion above illustrates two points. First, it is clear that although current discourse theories are built on fundamentally different principles, they all share some common intuitions. Sure, some theories talk about &quot;negative polarity&quot; while others about &quot;contrast&quot;. Some theories refer to &quot;causes&quot;, some to &quot;potential causes&quot;, and some to &quot;explanations&quot;. But ultimately, all these theories acknowledge that there are such things as CONTRAST, CAUSE, and EXPLANATION relations. Second, given the complexity of the definitions these theories propose, it is clear why it is difficult to build programs that recognize such relations in unrestricted texts. Current NLP techniques do not enable us to reliably infer from sentence 1.a that &quot;cannot buy arms legally(libya)&quot; and do not give us access to general purpose knowledge bases that assert that &quot;similar(libya,rwanda)&quot;. The approach we advocate in this paper is in some respects less ambitious than current approaches to discourse relations because it relies upon a much smaller set of relations than those used by Mann and Thompson (1988) or Martin (1992). In our work, we decide to focus only on four types of relations, which we call: CONTRAST, CAUSE-EXPLANATION-EVIDENCE (CEV), CONDITION, and ELABORA-TION. (We define these relations in Section 2.2.) In other respects though, our approach is more ambitious because it focuses on the problem of recognizing such discourse relations in unrestricted texts. In other words, given as input sentence pairs such as those shown in (1)-(2), we develop techniques and programs that label the relations that hold between these sentence pairs as CONTRAST, CAUSE-EXPLANATION-EVIDENCE, CONDITION, ELABORATION or NONE-OF-THE-ABOVE, even when the discourse relations are not explicitly signalled by discourse markers.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Discourse relation definitions </SectionTitle> <Paragraph position="0"> The discourse relations we focus on are defined at a much coarser level of granularity than in most discourse theories. For example, we consider that a CONTRAST relation holds between two text spans if one of the following relations holds: CONTRAST, ANTITHESIS, CONCESSION, or OTH-ERWISE, as defined by Mann and Thompson (1988), CONTRAST or VIOLATED EXPECTATION, as defined by Hobbs (1990), or any of the relations characterized by this regular expression of cognitive primitives, as defined by Knott and Sanders (1998): (CAUSAL a3 ADDITIVE) - (SEMANTIC a3 PRAGMATIC) - NEGATIVE. In other words, in our approach, we do not distinguish between contrasts of semantic and pragmatic nature, contrasts specific to violated expectations, etc. Table 1 shows the definitions of the relations we considered.</Paragraph> <Paragraph position="1"> The advantage of operating with coarsely defined discourse relations is that it enables us to automatically construct relatively low-noise datasets that can be used for learning. For example, by extracting sentence pairs that have the keyword &quot;But&quot; at the beginning of the second sentence, as the sentence pair shown in (1), we can automatically collect many examples of CONTRAST relations. And by extracting sentences that contain the keyword &quot;because&quot;, we can automatically collect many examples of CAUSE-EXPLANATION-EVIDENCE relations. As previous research in linguistics (Halliday and Hasan, 1976; Schiffrin, 1987) and computational linguistics (Marcu, 2000) show, some occurrences of &quot;but&quot; and &quot;because&quot; do not have a discourse function; and others signal other relations than CONTRAST and CAUSE-EXPLANATION. So we can expect the examples we extract to be noisy. However, empirical work of Marcu (2000) and Carlson et al. (2001) suggests that the majority of occurrences of &quot;but&quot;, for example, do signal CONTRAST relations. (In the RST corpus built by Carlson et al. (2001), 89 out of the 106 occurrences of &quot;but&quot; that occur at the beginning of a sentence signal a CONTRAST relation that holds between the sentence that contains the word &quot;but&quot; and the sentence that precedes it.) Our hope is that simple extraction methods are sufficient for collecting low-noise training corpora.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Generation of training data </SectionTitle> <Paragraph position="0"> In order to collect training cases, we mined in an unsupervised manner two corpora. The first corpus, which we call Raw, is a corpus of 1 billion words of unannotated English (41,147,805 sentences) that we created by catenating various corpora made available over the years by the Linguistic Data Consortium. The second, called BLIPP, is a corpus of only 1,796,386 sentences that were parsed automatically by Charniak (2000). We extracted from both corpora all adjacent sentence pairs that contained the cue phrase &quot;But&quot; at the beginning of the second sentence and we automatically labeled the relation between the two sentence pairs as CONTRAST. We also extracted all the sentences that contained the word &quot;but&quot; in the middle of a sentence; we split each extracted sentence into two spans, one containing the words from the beginning of the sentence to the occurrence of the keyword &quot;but&quot; and one containing the words from the occurrence of &quot;but&quot; to the end of the sentence; and we labeled the relation between the two resulting text spans as CONTRAST as well.</Paragraph> <Paragraph position="1"> Table 2 lists some of the cue phrases we used in order to extract CONTRAST, CAUSE- null corpus of text span pairs labeled with discourse relations. null CONDITION relations and the number of examples extracted from the Raw corpus for each type of discourse relation. In the patterns in Table 2, the symbols BOS and EOS denote BeginningOfSentence and EndOfSentence boundaries, the &quot;a9a10a9a10a9 &quot; stand for occurrences of any words and punctuation marks, the square brackets stand for text span boundaries, and the other words and punctuation marks stand for the cue phrases that we used in order to extract discourse relation examples. For example, the pattern [BOS Although a9a10a9a10a9 ,] [ a9a10a9a10a9 EOS] is used in order to extract examples of CONTRAST relations that hold between a span of text delimited to the left by the cue phrase &quot;Although&quot; occurring in the beginning of a sentence and to the right by the first occurrence of a comma, and a span of text that contains the rest of the sentence to which &quot;Although&quot; belongs.</Paragraph> <Paragraph position="2"> We also extracted automatically 1,000,000 examples of what we hypothesize to be non-relations, by randomly selecting non-adjacent sentence pairs that are at least 3 sentences apart in a given text. We label such examples NO-RELATION-SAME-TEXT. And we extracted automatically 1,000,000 examples of what we hypothesize to be cross-document nonrelations, by randomly selecting two sentences from distinct documents. As in the case of CONTRAST and CONDITION, the NO-RELATION examples are also noisy because long distance relations are common in well-written texts.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Determining discourse relations using </SectionTitle> <Paragraph position="0"> Naive Bayes classifiers We hypothesize that we can determine that a CONTRAST relation holds between the sentences in (3) even if we cannot semantically interpret the two sentences, simply because our background knowledge tells us that good and fails are good indicators of contrastive statements.</Paragraph> <Paragraph position="1"> Similarly, we hypothesize that we can determine that a CONTRAST relation holds between the sentences in (1), because our background knowledge tells us that embargo and legally are likely to occur in contexts of opposite polarity. In general, we hypothesize that lexical item pairs can provide clues about the discourse relations that hold between the text spans in which the lexical items occur.</Paragraph> <Paragraph position="2"> To test this hypothesis, we need to solve two problems. First, we need a means to acquire vast amounts of background knowledge from which we can derive, for example, that the word pairs good - fails and embargo - legally are good indicators of CONTRAST relations. The extraction patterns described in Table 2 enable us to solve this problem.1 Second, given vast amounts of training material, we need a means to learn which pairs of lexical items are likely to co-occur in conjunction with each discourse relation and a means to apply the learned parameters to any pair of text spans in order to determine the discourse relation that holds between them. We solve the second problem in a Bayesian probabilistic framework.</Paragraph> <Paragraph position="3"> We assume that a discourse relation a12a14a13 that holds between two text spans, a15a17a16a19a18a20a15a22a21 , is determined by the word pairs in the cartesian product defined over the words in the two text spans a23a25a24a27a26a6a18a7a24a29a28a31a30a33a32a34a15a35a16a37a36a38a15a22a21 . In general, a word pair a23a25a24 a26 a18a7a24 a28 a30a39a32 a15 a16 a36a40a15 a21 can &quot;signal&quot; any relation a12a14a13 . We determine the most likely discourse relation that holds between two text spans a15a17a16 and a15a41a21 by taking the maximum over a42a43a12a31a44a43a45a46a42a48a47a37a49a8a50a52a51a53a23a25a12a54a13a55a3a56a15a35a16a52a18a20a15a22a21a14a30 , which according to Bayes rule, amounts to taking the maximum over a42a48a12a54a44a48a45a46a42a43a47a57a49a6a50a43a58a59a61a60a54a44a48a51a53a23a62a15a17a16a14a18a20a15a41a21a63a3a12a14a13a64a30a38a65a40a59a61a60a14a44a66a51a53a23a25a12a14a13a63a30a68a67 . If we assume that the word pairs in the cartesian product are independent, a51a53a23a62a15a17a16a52a18a20a15a41a21a63a3a12a14a13a69a30 is equivalent to a70a35a71a73a72a57a74a61a75a72a48a76a78a77a80a79a69a81a83a82a7a75a81a85a84a57a51a53a23a7a23a25a24a27a26a8a18a7a24a29a28a54a30a86a3a87a12a14a13a31a30 . The values a51a53a23a7a23a25a24a27a26a8a18a7a24a29a28a54a30a88a3a46a12a14a13a69a30 are computed using maximum likelihood estimators, which are smoothed using the</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Laplace method (Manning and Sch&quot;utze, 1999). </SectionTitle> <Paragraph position="0"> For each discourse relation pair a12a54a89a63a18a7a12a14a90 , we train a word-pair-based classifier using the automatically derived training examples in the Raw corpus, from which we first removed the cue-phrases used for extracting the examples. This ensures that our classi1Note that relying on the list of antonyms provided by Word-net (Fellbaum, 1998) is not enough because the semantic relations in Wordnet are not defined across word class boundaries. For example, Wordnet does not list the &quot;antonymy&quot;-like relation between embargo and legally.</Paragraph> <Paragraph position="1"> fiers do not learn, for example, that the word pair if - then is a good indicator of a CONDITION relation, which would simply amount to learning to distinguish between the extraction patterns used to construct the corpus. We test each classifier on a test corpus of 5000 examples labeled with a12 a89 and 5000 examples labeled with a12a14a90 , which ensures that the baseline is the same for all combinations a12 a89 and a12a14a90 , namely 50%.</Paragraph> <Paragraph position="2"> Table 3 shows the performance of all discourse relation classifiers. As one can see, each classifier outperforms the 50% baseline, with some classifiers being as accurate as that that distinguishes between CAUSE-EXPLANATION-EVIDENCE and ELABORATION relations, which has an accuracy of 93%. We have also built a six-way classifier to distinguish between all six relation types. This classifier has a performance of 49.7%, with a baseline of 16.67%, which is achieved by labeling all relations as CONTRASTS. null We also examined the learning curves of various classifiers and noticed that, for some of them, the addition of training examples does not appear to have a significant impact on their performance. For example, the classifier that distinguishes between CONTRAST and CAUSE-EXPLANATION-EVIDENCE relations has an accuracy of 87.1% when trained on 2,000,000 examples and an accuracy of 87.3% when trained on 4,771,534 examples. We hypothesized that the flattening of the learning curve is explained by the noise in our training data and the vast amount of word pairs that are not likely to be good predictors of discourse relations.</Paragraph> <Paragraph position="3"> To test this hypothesis, we decided to carry out a second experiment that used as predictors only a subset of the word pairs in the cartesian product defined over the words in two given text spans.</Paragraph> <Paragraph position="4"> To achieve this, we used the patterns in Table 2 to extract examples of discourse relations from the BLIPP corpus. As expected, the BLIPP corpus yielded much fewer learning cases: 185,846 CON- null a simple program that extracted the nouns, verbs, and cue phrases in each sentence/clause. We call these the most representative words of a sentence/discourse unit. For example, the most representative words of the sentence in example (4), are those shown in italics.</Paragraph> <Paragraph position="5"> Italy's unadjusted industrial production fell in January 3.4% from a year earlier but rose 0.4% from December, the government said (4) We repeated the experiment we carried out in conjunction with the Raw corpus on the data derived from the BLIPP corpus as well. Table 4 summarizes the results.</Paragraph> <Paragraph position="6"> Overall, the performance of the systems trained on the most representative word pairs in the BLIPP corpus is clearly lower than the performance of the systems trained on all the word pairs in the Raw corpus. But a direct comparison between two classifiers trained on different corpora is not fair because with just 100,000 examples per relation, the systems trained on the Raw corpus are much worse than those trained on the BLIPP data. The learning curves in Figure 1 are illuminating as they show that if one uses as features only the most representative word pairs, one needs only about 100,000 training examples to achieve the same level of performance one achieves using 1,000,000 training examples and features defined over all word pairs. Also, since the learning curve for the BLIPP corpus is steeper than trained on the Raw and BLIPP corpora.</Paragraph> <Paragraph position="7"> the learning curve for the Raw corpus, this suggests that discourse relation classifiers trained on most representative word pairs and millions of training examples can achieve higher levels of performance than classifiers trained on all word pairs (unannotated data).</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Relevance to RST </SectionTitle> <Paragraph position="0"> The results in Section 3 indicate clearly that massive amounts of automatically generated data can be used to distinguish between discourse relations defined as discussed in Section 2.2. What the experiments manually labeled RST relations that hold between elementary discourse units. Performance results are shown in bold; baselines are shown in normal fonts. in Section 3 do not show is whether the classifiers built in this manner can be of any use in conjunction with some established discourse theory. To test this, we used the corpus of discourse trees built in the style of RST by Carlson et al. (2001). We automatically extracted from this manually annotated corpus all CONTRAST, CAUSE-EXPLANATION-EVIDENCE, CONDITION and ELABORATION relations that hold between two adjacent elementary discourse units.</Paragraph> <Paragraph position="1"> Since RST (Mann and Thompson, 1988) employs a finer grained taxonomy of relations than we used, we applied the definitions shown in Table 1. That is, we considered that a CONTRAST relation held between two text spans if a human annotator labeled the relation between those spans as ANTITHESIS, CONCESSION, OTHERWISE or CONTRAST. We re-trained then all classifiers on the Raw corpus, but this time without removing from the corpus the cue phrases that were used to generate the training examples. We did this because when trying to determine whether a CONTRAST relation holds between two spans of texts separated by the cue phrase &quot;but&quot;, for example, we want to take advantage of the cue phrase occurrence as well. We employed our classifiers on the manually labeled examples extracted from Carlson et al.'s corpus (2001). Table 5 displays the performance of our two way classifiers for relations defined over elementary discourse units. The table displays in the second row, for each discourse relation, the number of examples extracted from the RST corpus. For each binary classifier, the table lists in bold the accuracy of our classifier and in non-bold font the majority baseline associated with it.</Paragraph> <Paragraph position="2"> The results in Table 5 show that the classifiers learned from automatically generated training data can be used to distinguish between certain types of RST relations. For example, the results show that the classifiers can be used to distinguish between</Paragraph> </Section> class="xml-element"></Paper>