File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/98/w98-1125_evalu.xml

Size: 9,389 bytes

Last Modified: 2025-10-06 14:00:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1125">
  <Title>Discourse Parsing: A Decision Tree Approach</Title>
  <Section position="6" start_page="220" end_page="221" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="220" end_page="220" type="sub_section">
      <SectionTitle>
3.1 Data
</SectionTitle>
      <Paragraph position="0"> To evaluate our method, we have done a set of experiments, using data from a Japanese economics daily (Nihon-Keizai-Shimbun-Sha, 1995). They consist of 645 articles of diverse text types (prose, narrative, news report, expository text, editorial, etc.), which are randomly drawn from the entire set of articles published during the year. Sentences and paragraphs contained in the data set totalled 12,770 and 5,352, respectively. We had, on the average, 984.5 characters, 19.2 sentences, and 8.2 paragraphs, for one article in the data. Each sentence in an article was annotated with a link to its associated modifyee sentence. Annotations were given manually by the first author. Each sentence was associated with exactly one sentence.</Paragraph>
      <Paragraph position="1"> In assigning a link tag to a sentence, we did not follow any specific discourse theories such as Rhetorical Structure Theory (Mann and Thompson, 1987).</Paragraph>
      <Paragraph position="2"> This was because they often do not provide information on discourse relations detailed enough to serve as tagging guidelines. In the face of this, we fell back on our intuition to determine which sentence links with which. Nonetheless, we followed an informal rule, motivated by a linguistic theory of cohesion by Halliday and Hasan (1990): which says that we relate a sentence to one that is contextually most relevant to it, or one that has a cohesive link with it. This included not only rhetorical relationships such as 'reason', 'cause-result', 'elaboration', 'justification' or 'background' (Mann and Thompson, 1987), but also communicative relationships such as 'question-answer' and those of the 'initiativeresponse' sort (Fox, 1987; Levinson, 1994; Carletta et al., 1997).</Paragraph>
      <Paragraph position="3"> Since the amount of data available at the time of the experiments was rather moderate (645 articles), we decided to resort to a test procedure known as cross-validation. The following is a quote from Quinlan (1993).</Paragraph>
      <Paragraph position="4"> &amp;quot;In this procedure, the available data is divided into N blocks so as to make each block's number of cases and class distribution as uniform as possible. N different classification models are then built, in each of which one block is omitted from the training data, and the resulting model is tested on the cases in that omitted block.&amp;quot; The average performance over the N tests is supposed to be a good predictor of the performance of a model built from all the data. It is common to set N=IO.</Paragraph>
      <Paragraph position="5"> However, we are concerned here with the accuracy of dependency parses and not with that of class decisions by decision tree models. This requires some modification to the way the validation procedure is applied to the data. What we did was to apply the procedure not on the set of cases as in C4.5, but on the set of articles. We divided the set of articles into 10 blocks in such a way that each block contains as uniform a number of sentences as possible. The procedure would make each block contain a uniform number of correct dependencies. (Recall that every sentence in an article is manually annotated with exactly one link. So the number of correct links equals that of sentences.) The number of sentences in each block ranged from 1,256 to 1,306.</Paragraph>
      <Paragraph position="6"> The performance is rated for each article in the test set by using a metric: number of correct dependencies retrieved precision = total number of dependencies retrieved At each validation step, we took an average performance score for articles in the test set as a precision of that step's model. Results from 10 parsing models were then averaged to give a summary figure.</Paragraph>
    </Section>
    <Section position="2" start_page="220" end_page="221" type="sub_section">
      <SectionTitle>
3.2 Results and Analyses
</SectionTitle>
      <Paragraph position="0"> We list major results of the experiments in Table 3, The results show that clues are not of much help to improve performance. Indeed we get the best result of 0.642 when N = 0, i.e., the model does not use clues at all. We even find that an overall performance tends to decline as models use more Of the words in the corpus as clues. It is somewhat tempting to take the results as indicating that clues have bad effects on the performance (more discussion on this later). This, however, appears to run counter to what we expect from results reported in prior work on discourse(Kurohashi and Nagao, 1994; Litman and Passonneau, 1995; Grosz and Sidner, 1986; Marcu, 1997), where the notion of clues or cue phrases forms an important part of identifying a structure of discourse7 Table 4 shows how the confidence value (CF) affects the performance of discourse models. The CF 7 One problem with earlier work is that evaluations are done on very small data; 9 sections from a scientific writing (approx. 300 sentences) (Kurohashi and Nagao, 1994): 15 narrathes (I113 clauses) (Lhman and Passonneau. 1995): 3 texts (Marcu, 1997). It is not clear how reliable estimates of performance obtained there would be.</Paragraph>
      <Paragraph position="1">  parentheses represent the ratio of improvements against a model with N = 0.  represents the extent to which a decision tree is pruned; A small CF leads to a heavy pruning of a tree. The tree pruning is a technique by which to prevent a decision tree from fitting training data too closely. The problem of a model fitting data too closely or overfitting usually causes an increase of errors on unseen data. Thus a heavier pruning of a tree would result in a more general tree.</Paragraph>
      <Paragraph position="2"> While Haruno (1997) reports that a less pruning produces a better performance for Japanese sentence parsing with a decision tree, results we got in Table 4 show that this is not true with discourse parsing. In Haruno (1997), the performance improves by 1.8% from 82.01% (CF = 25%) to 83.35% (CF = 95%).</Paragraph>
      <Paragraph position="3"> 25% is the default value for CF in C4.5, which is generally known to be the best CF level in machine learning. Table 4 shows that this is indeed the case: we get a best performance at around CF = 25% for all the values of N.</Paragraph>
      <Paragraph position="4"> Let us turn to effects that each feature might have on the model's performance. For each feature, we removed it from the model and trained and tested the model on the same set of data as before the removal.</Paragraph>
      <Paragraph position="5"> Results are summarized in Table 5. It was found that, of the features considered, DistSen, which encodes a distance between two sentences, contributes most to the performance; at N = 0, its removal caused as much as an 8.62% decline in performance.</Paragraph>
      <Paragraph position="6"> On the other hand, lexical features Sire and Sire2 made little contribution to the overall performance; their removal even led to a small improvement in some cases, which seems consistent with the earlier observation that lexical features are a poor class predictor. null To further study effects of lexical clues, we have run another experiment where clues are limited to sentence connectives (as identified by a tokenizer program). Clues included any connective that has an occurrence in the corpus, which is listed in Table 6. Since a sentence connective is relevant to establishing inter-sententiaI relationships, it was expected that restricting clues to connectives would improve performance. As with earlier experiments, we have run a 10-fold cross validation experiment on the corpus, with 52 attributes for lexical clues. We found that the accuracy was 0.642. So it turned out that using connectives is no better than when we do not use clues at all.</Paragraph>
      <Paragraph position="7"> Figure 5 gives a graphical summary of the significance of features in terms of the ratio of improvement after their removal (given as parenthetical figures in Table 5). Curiously, while the absence of the DistSen feature caused a largest decline, the significance of a feature tends to diminish with the growth of N. The reason, we suspect, may have to do with the susceptibility of a decision tree model to irrelevant features, particularly when their number is large. But some more work needs to be done before we can say anything about how irrelevancy affects a parser's performance.</Paragraph>
      <Paragraph position="8"> One caveat before leaving the section; the experiments so far did not establish any correlation, either positive or negative, between the use of lexical information and the performance on discourse parsing.</Paragraph>
      <Paragraph position="9"> To say anything definite would probably require experiments on a corpus much larger than is currently available. However, it would be safe to say that distance and length features are more prominent than lexical features when a corpus is relatively small.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML