File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1005_evalu.xml

Size: 4,486 bytes

Last Modified: 2025-10-06 13:59:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1005">
  <Title>Bootstrapping Path-Based Pronoun Resolution</Title>
  <Section position="8" start_page="38" end_page="39" type="evalu">
    <SectionTitle>
6 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We compare the accuracy of various con gurations of our system on the ANC, AQT and MUC datasets (Table 5). We include the score from picking the noun immediately preceding the pronoun (after our hard lters are applied). Due to the hard lters and limited search window, it is not possible for our system to resolve every noun to a correct antecedent. We thus provide the performance upper bound (i.e. the proportion of cases with a correct answer in the ltered candidate list). On ANC and AQT, each of the probabilistic features results in a statistically signi cant gain in performance over a model trained and tested with that feature absent.5 On the smaller MUC set, none of the differences in 3-6 are statistically signi cant, however, the relative contribution of the various features remains reassuringly constant.</Paragraph>
    <Paragraph position="1"> Aside from missing antecedents due to the hard lters, the main sources of error include inaccurate statistical data and a classi er bias toward preceding pronouns of the same gender/number. It would be interesting to see whether performance could be improved by adding WordNet and web-mined features. Path coreference itself could conceivably be determined with a search engine.</Paragraph>
    <Paragraph position="2"> Gender is our most powerful probabilistic feature. In fact, inspecting our system's decisions, gender often rules out coreference regardless of path coreference. This is not surprising, since we based the acquisition of C(p) on gender. That is,  varying SVM-thresholds.</Paragraph>
    <Paragraph position="3"> our bootstrapping assumption was that the majority of times these paths occur, gender indicates coreference or lack thereof. Thus when they occur in our test sets, gender should often suf ciently indicate coreference. Improving the orthogonality of our features remains a future challenge.</Paragraph>
    <Paragraph position="4"> Nevertheless, note the decrease in performance on each of the datasets when C(p) is excluded (#5). This is compelling evidence that path coreference is valuable in its own right, beyond its ability to bootstrap extensive and reliable gender data. Finally, we can add ourselves to the camp of people claiming semantic compatibility is useful for pronoun resolution. Both the MI from the pronoun in the antecedent's context and vice-versa result in improvement. Building a model from enough text may be the key.</Paragraph>
    <Paragraph position="5"> The primary goal of our evaluation was to assess the bene t of path coreference within a competitive pronoun resolution system. Our system does, however, outperform previously published results on these datasets. Direct comparison of our scoring system to other current top approaches is made dif cult by differences in preprocessing.</Paragraph>
    <Paragraph position="6"> Ideally we would assess the bene t of our probabilistic features using the same state-of-the-art preprocessing modules employed by others such as (Yang et al., 2005) (who additionally use a search engine for compatibility scoring). Clearly, promoting competitive evaluation of pronoun resolution scoring systems by giving competitors equivalent real-world preprocessing output along the lines of (Barbu and Mitkov, 2001) remains the best way to isolate areas for system improvement.</Paragraph>
    <Paragraph position="7"> Our pronoun resolution system is part of a larger information retrieval project where resolution ac- null curacy is not necessarily the most pertinent measure of classi er performance. More than one candidate can be useful in ambiguous cases, and not every resolution need be used. Since the SVM ranks antecedent candidates, we can test this ranking by selecting more than the top candidate (Topn) and evaluating coverage of the true antecedents. We can also resolve only those instances where the most likely candidate is above a certain distance from the SVM threshold. Varying this distance varies the precision-recall (PR) of the overall resolution. A representative PR curve for the Top-n classi ers is provided (Figure 2). The corresponding information retrieval performance can now be evaluated along the Top-n / PR con gurations.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML