File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1034_evalu.xml
Size: 4,699 bytes
Last Modified: 2025-10-06 13:59:04
<?xml version="1.0" standalone="yes"?> <Paper uid="C04-1034"> <Title>Resolving Individual and Abstract Anaphora in Texts and Dialogues</Title> <Section position="5" start_page="7" end_page="8" type="evalu"> <SectionTitle> 4 Tests and Evaluation </SectionTitle> <Paragraph position="0"> We have manually tested dar on randomly chosen texts and dialogues from our collections.</Paragraph> <Paragraph position="1"> The performance of dar on dialogues has been compared with that of es00. The function for resolving IPAs(ResolveIpa) has similarly been tested on texts, where APAswereexcluded. We have compared the obtained results with those obtained by testing bfp (Brennan et al., 1987) and str98 (Strube, 1998).</Paragraph> <Paragraph position="2"> In all tests the intrasentential anaphors have been manually resolved and expletive and cataphoric uses of pronouns have been marked and excluded from the test. Dialogue act units were marked and classified by three annotators following (Eckert and Strube, 2000). The reliability for the two annotation tasks (k-statistics (Carletta, 1996)) was of 0.94 and 0.90 respectively. Pronominal anaphors were marked, classified and resolved by two annotators. The k-statistics for the pronoun classification was 0.86. In few cases (one in the texts and two in the dialogues) where the annotators did not agree upon resolution, the pronouns were marked as ambiguous and were excluded from the test. The results obtained for bfp and str98 are given in table 1, while the results of dar-s ResolveIpa are given in table 2. In the tables CR stands for &quot;correctly resolved&quot;, HR stands for &quot;resolved by humans&quot;, RA stands for &quot;resolved over all&quot;, P stands for precision and R stands for recall.</Paragraph> <Paragraph position="3"> Because dar both classifies and resolves anaphors, both precision and recall (respect to human resolution) are given in table 2. The results indicate that ResolveIpa performs significantly better than bfp and str98 on the Danish texts. The better performance of dar was due to the account of focal and parallelism preferences, of the different reference mechanisms of personal and demonstrative pronouns and to the enlarged resolution scope. Furthermore dar recognises some generic pronouns and inferable pronouns and excludes them from resolution, but often fails to recognise antecedentless and inferable plural pronouns, because it finds a plural nominal in the preceding discourse and proposes it as antecedent. The lack of commonsense knowledge explains many incorrectly resolved anaphors. The results of the test of the dar algorithm on written texts are in table 3. These results are good compared with the results of the function ResolveIpa (table 2).</Paragraph> <Paragraph position="4"> The discriminating rules identify correctly IPAs and APAs in the large majority of the cases.</Paragraph> <Paragraph position="5"> Recognition failure often involves pronouns in contexts which are not covered by the discriminating rules. In particular dar fails to resolve singular neuter gender pronouns with distant antecedents and to identify vague anaphors, because it always &quot;finds&quot; an antecedent in the context ranking. Correct resolution in these cases requires a deep analysis of the context. The results of applying dar and es00 on Danish dialogues are reported in table 4.</Paragraph> <Paragraph position="6"> The results of the tests indicate that dar resolves IPAssignificantly better than es00 (which uses str98). We extended es00 with the Danish-specific identification rules before applying it.</Paragraph> <Paragraph position="7"> dar correctly resolves more Danish demonstrative pronouns than es00, because it accounts for language-specific particularities. In general, however, the resolution results for APAsare similar to those obtained for es00.Thisisnot surprising, because dar usesthesameresolution strategy on these pronouns. dar performs better on texts than on dialogues. This reflects the more complex nature of dialogues. The results indicate that the IPA/APA discriminating rules also work well on dialogues. The cases of resolution failure were the same as for the texts. As an experiment we applied dar on the dialogues without relying on the predefined dialogue structure. In this test the recognition of IPAsandAPAs was still good, however the success rate for IPAs was of 60.1 % and for APAs was of only 39.3%. Many errors were due to the fact that antecedents were searched for in the preceding discourse in linear order and that ungrounded utterances were included in the discourse model.</Paragraph> </Section> class="xml-element"></Paper>