XML Viewer - w05-0612

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0612_evalu.xml
Size: 6,868 bytes
Last Modified: 2025-10-06 13:59:31
<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0612">
  <Title>An Expectation Maximization Approach to Pronoun Resolution</Title>
  <Section position="8" start_page="92" end_page="94" type="evalu">
    <SectionTitle>
5 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="92" end_page="93" type="sub_section">
      <SectionTitle>
5.1 Validation of unsupervised method
</SectionTitle>
      <Paragraph position="0"> The key concern of our work is whether enough useful information is present in the pronoun's category, context, and candidate list for unsupervised learning of antecedents to occur. To that end, our first set of experiments compare the pronoun resolution accuracy of our EM-based solutions to that of a previous-noun baseline on our test key. The results are shown in Table 2. The columns split the results into three cases: all pronouns with no exceptions; all cases where the pronoun was found in a sentence containing no quotation marks (and therefore resembling the training data provided to EM); and finally all pronouns excluded by the second case. We compare the following methods:  1. Previous noun: Pick the candidate from the filtered list with the lowest j value.</Paragraph>
      <Paragraph position="1"> 2. EM, no initializer: The EM algorithm trained on the test set, starting from a uniform E-step.</Paragraph>
      <Paragraph position="2"> 3. Initializer, no EM: A model that ranks candidates using only a pronoun model built from unambiguous cases (Section 3.5).</Paragraph>
      <Paragraph position="3"> 4. EM w/ initializer: As in (2), but using the initializer in (3) for the first E-step.</Paragraph>
      <Paragraph position="4"> 5. Maxent extension: The models produced by (4) are used as features in a log-linear model trained on the development key (Section 3.6).</Paragraph>
      <Paragraph position="5"> 6. Upper bound: The percentage of cases with a  correct answer in the filtered candidate list.</Paragraph>
      <Paragraph position="6"> For a reference point, picking the previous noun before applying any of our candidate filters receives an accuracy score of 0.281 on the &amp;quot;All&amp;quot; task. Looking at the &amp;quot;All&amp;quot; column in Table 2, we see EM can indeed learn in this situation. Starting from uniform parameters it climbs from a 40% baseline to a 60% accurate model. However, the initializer can do slightly better with precise but sparse gender/number information alone. As we hoped, combining the initializer and EM results in a statistically significant1 improvement over EM with a uniform starting point, but it is not significantly better than the initializer alone. The advantage of the EM process is that it produces multiple models, which can be re-weighted with maximum entropy to reach our highest accuracy, roughly 67%. The l weights that achieve this score are shown in Table 3.</Paragraph>
      <Paragraph position="7"> Maximum entropy leaves the pronoun model Pr(p|l) nearly untouched and drastically reduces the  influence of all other models (Table 3). This, combined with the success of the initializer alone, leads us to believe that a strong notion of gender/number is very important in this task. Therefore, we implemented EM with several models that used only pronoun category, but none were able to surpass the initializer in accuracy on the test key. One factor that might help explain the initializer's success is that despite using only a PrU(p|l) model, the initializer also has an implicit factor resembling a Pr(l) model: when two candidates agree with the category of the pronoun, add-1 smoothing ensures the more frequent candidate receives a higher probability.</Paragraph>
      <Paragraph position="8"> As was stated in Section 3.4, sentences with quotations were excluded from the learning process because the presence of a correct antecedent in the candidate list was less frequent in these cases. This is validated by the low upper bound of 0.754 in the only-quote portion of the test key. We can see that all methods except for the previous noun heuristic score noticeably better when ignoring those sentences that contain quotation marks. In particular, the difference between our three unsupervised solutions ((2), (3) and (4)) are more pronounced. Much of the performance improvements that correspond to our model refinements are masked in the overall task because adding the initializer to EM does not improve EM's performance on quotes at all. Developing a method to construct more robust candidate lists for quotations could improve our performance on these cases, and greatly increase the percentage of pronouns we are training on for a given corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="93" end_page="93" type="sub_section">
      <SectionTitle>
5.2 Comparison to supervised system
</SectionTitle>
      <Paragraph position="0"> We put our results in context by comparing our methods to a recent supervised system. The comparison system is an SVM that uses 52 linguistically-motivated features, including probabilistic gender/number information obtained through web queries (Bergsma, 2005a). The SVM is trained with 1398 separate labeled pronouns, the same training set used in (Bergsma, 2005a). This data is also drawn from the news domain. Note the supervised system was not constructed to handle all pronoun cases, so non-anaphoric pronouns were removed from the test key and from the candidate lists in the test key to ensure a fair comparison. As expected, this removal of difficult cases increases the performance of our system on the test key (Table 4).</Paragraph>
      <Paragraph position="1"> Also note there is no significant difference in performance between our supervised extension and the SVM. The completely unsupervised EM system performs worse, but with only a 7% relative reduction in performace compared to the SVM; the previous noun heuristic shows a 44% reduction.</Paragraph>
    </Section>
    <Section position="3" start_page="93" end_page="94" type="sub_section">
      <SectionTitle>
5.3 Analysis of upper bound
</SectionTitle>
      <Paragraph position="0"> If one accounts for the upper bound in Table 2, our methods do very well on those cases where a correct answer actually appears in the candidate list: the best EM solution scores 0.754, and the supervised extension scores 0.800. A variety of factors result in the 196 candidate lists that do not contain a true antecedent. 21% of these errors arise from our limited candidate window (Section 3.1). Incorrect pleonastic detection accounts for another 31% while non- null noun referential pronouns cause 25% (Section 3.3).</Paragraph>
      <Paragraph position="1"> Linguistic filters (Section 3.4) account for most of the remainder. An improvement in any of these components would result in not only higher final scores, but cleaner EM training data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML