XML Viewer - w96-0412

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/w96-0412_evalu.xml
Size: 8,375 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="W96-0412">
  <Title>An Evaluation of Anaphor Generation in Chinese</Title>
  <Section position="6" start_page="113" end_page="118" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"> In this section, we investigate the result of the comparisons made in the last section. The comparison result is shown in Table 1. The average matching rates for all test texts are 72, 74 and 76%.</Paragraph>
    <Paragraph position="1"> This average matching rate, however, is lower than the matching rates, about 92%, we obtained in the empirical studies described pre-</Paragraph>
    <Paragraph position="3"> viously (YM94; Yeh95). The problem is partly because the test texts used in the former comparison are human-created, while the test texts used here are machine-generated. The grammatical structures of the machine-created texts are simplified; they are not as sophisticated as human texts. In the evaluation work, when the speakers were asked to decide their preferences for anaphors in the machine-generated texts, they may find less complete information shown in the test texts than what they are used to in creating their own texts and hence it may be difficult for them to make their own decisions. In the empirical study, the human-created texts perhaps provided more sufficient information for the hypothetical machine to decide on an appropriate anaphoric form.</Paragraph>
    <Paragraph position="4"> A more important reason why the matching rates are lower than before could be that in some circumstances there may be more than one acceptable solution and the speakers may not always choose the same one as the machine.</Paragraph>
    <Paragraph position="5"> This hypothesis can be investigated by looking at the extent to which the speakers agree among themselves. To see how the speakers agree among themselves, we further made a comparison between the speakers' annotations, which is summarised as below.</Paragraph>
    <Paragraph position="6"> for each speaker i for each text j compare i's with the rest of speakers' annotations and note down the average number of the matches The comparison result is shown in Table 2. For each speaker, the number for each test text is the average number of matches with the other eleven speakers. For example, Speaker 1 receives, on average, 8 matches for Text 2. At the end of the table are the average numbers for the speakers' agreement among themselves. The figures in the table show that the speakers do not achieve an agreement among themselves for the use of anaphors in this test. These figures are further supported by the kappa statistic, a standard measure of agreement between a set of judges (SC88). The overall kappa value for all speakers is about 0.41, whereas a value of 0.8 or over would normally be required for good evidence of agreement. The measure of agreement gets worse if only the zero/ pronoun/ nominal distinction is considered or if zero and non-zero pronouns are lumped together. Only two speakers agree with one another with a kappa value of more than 0.7 (none with a value of greater  than 0.8). The speakers as a whole agreed with kappa greater than 0.7 on 30 out of the 80 anaphors, with complete agreement only 14 times. To get an overall agreement of greater than 0.8 would require reducing the set of speakers from 12 to a carefully selected 3.</Paragraph>
    <Paragraph position="7"> As shown in Fig. 4, the anaphors in Text 1 form a &amp;quot;topic chain&amp;quot; 4 within a single &amp;quot;sentence&amp;quot;. These anaphors are all zeroed according to the conditions of locality and syntactic constraints in the three test rules. All three systems produce the same result for Text 1 and, hence, unsurprisingly all three systems have the same matching rate, 90%, as shown in Table 1.</Paragraph>
    <Paragraph position="8"> Text 2 similarly contains a single &amp;quot;sentence&amp;quot; but has three topic shifts in addition to &amp;quot;topic chains&amp;quot; within the &amp;quot;sentence&amp;quot; as shown in Fig. 4. Since no discourse segment boundaries occur within the &amp;quot;sentence&amp;quot;, the discourse segment boundary constraint in TR2 has no effect on this test text, which means that both TR1 and TR2 produce the same output. However, there are three topic shifts within the &amp;quot;sentence&amp;quot;, namely, clauses 5 and 6, 8 and 9, and 10 and 11, as shown in Fig. 4. The shifts would make the rule containing the salience constraint, TR3, obtain different output from those without this constraint, TRi and TR2 obtain the same matching rate, 70%. TR3 obtains higher matching rates than the other two, 79%, which shows the effectiveness of the salience constraint in it.</Paragraph>
    <Paragraph position="9"> We then examine another middle-sized test text, Text 3, which is broken into three &amp;quot;sentences,&amp;quot; as shown in Fig. 4. The beginning of a &amp;quot;sentence&amp;quot; is the beginning of a discourse segment in our implementation (Yeh95). Furthermore, there are three topic shifts occurring in Text 3, i.e., clauses 8 and 9, 10 and 11, and 11 and 12. The constraint of discourse segment beginnings in TRP and TR3 and the salience constraint in TR3 would therefore have some effects on the output texts. The matching rates, as shown in Table 1, increase from 62 to 66% for TR2, which shows that the constraint on 4A &amp;quot;topic chain&amp;quot; is a situation where a referent is referred to in the first clause, and then several more clauses follow talking about the same referent, namely, the toi'~c. discourse segment beginnings in TRP is effective. TR3 obtains 65% matching rate, on average, which is 1% lower than its predecessor TR2. However, this decrease of average matching rate does not deny the effectiveness of the salience constraint in TR3. TR2's text differs from TR3's in the three topic shifts: TR2 generates zero anaphors for these shifts, while TR3 generates full descriptions. The speakers varied greatly in choosing anaphoric forms for these topic shifts: among twelve speakers, four chose all full descriptions, three used all zero anaphors, and the other five chose zero, pronominal and nominal anaphors. Thus, four among the twelve speakers completely agree with TR3, while three agree with TR2. This shows that the salience constraint in TR3 is still effective. Then we examine the more complicated texts, Texts 4 and 5. As shown in Table 1, the increases of matching rates show the effectiveness of the constraint of discourse segment beginning in TR2. Again, the average matching rates of TR3 are sightly lower than TR2 for these two texts. However, similar to the situation in Text 3, the speakers have varied agreement on the choice of anaphors for the topic shiftings in these two texts. For Text 4, three and one speaker completely agree with TRP and TR3, respectively. As for Text 5, two speakers completely agree with TR2, while the others partly agree with TR2 and TR3.</Paragraph>
    <Paragraph position="10"> The above discussions show that the salience constraint in TR3 is sometimes effective in getting small improvements in the output texts.</Paragraph>
    <Paragraph position="11"> This shows the difference of concepts of salience used between the speakers and TR3. In brief, the more sophisticated constraints a rule contains, the better it performs. Both TR2 and TR3 perform better than TRi. TR3 performs better than TR2 for texts with simple discourse segment structure. For the texts having complicated discourse segment structures, TR2 is slightly better than TR3 on average matching rates. Adding the results of the rules to those of the speakers leads to a slight decrease in kappa for TR1 but progressively better (though only from 0.41 to 0.43) values for kappa for TR2 and TR3. This indicates that the better rules  seem to disagree with the speakers no more than the speakers disagree among themselves. There art 9 anaphors where the kappa score including TR3 is less than that for the speakers alone (in many other cases, the results being better).</Paragraph>
    <Paragraph position="12"> These seem to involve places where the speakers were more willing to use a zero pronoun (where the system used a reduced nominal anaphor) and where the speakers reduced nominal anaphors less than the system did.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML