File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/p05-1020_evalu.xml

Size: 8,210 bytes

Last Modified: 2025-10-06 13:59:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="P05-1020">
  <Title>Machine Learning for Coreference Resolution: From Local Classification to Global Ranking</Title>
  <Section position="5" start_page="160" end_page="161" type="evalu">
    <SectionTitle>
4 Evaluation
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="160" end_page="160" type="sub_section">
      <SectionTitle>
4.1 Experimental Setup
</SectionTitle>
      <Paragraph position="0"> For evaluation purposes, we use the ACE (Automatic Content Extraction) coreference corpus, which is composed of three data sets created from three different news sources, namely, broadcast news (BNEWS), newspaper (NPAPER), and newswire (NWIRE).5 Statistics of these data sets are shown in Table 2. In our experiments, we use the training texts to acquire coreference classifiers and evaluate the resulting systems on the test texts with respect to two commonly-used coreference scoring programs: the MUC scorer (Vilain et al., 1995) and the B-CUBED scorer (Bagga and Baldwin, 1998).</Paragraph>
    </Section>
    <Section position="2" start_page="160" end_page="161" type="sub_section">
      <SectionTitle>
4.2 Results Using the MUC Scorer
</SectionTitle>
      <Paragraph position="0"> Baseline systems. We employ as our baseline systems two existing coreference resolvers: our duplication of the Soon et al. (2001) system and the Ng and Cardie (2002b) system. Both resolvers adopt the standard machine learning approach and therefore can be characterized using the four elements discussed in Section 3.1. Specifically, Soon et al.'s system employs a decision tree learner to train a coreference classifier on instances created by Soon's method and represented by Soon's feature set, coordinating the classification decisions via closest-first clustering. Ng and Cardie's system, on the other hand, employs RIPPER to train a coreference classifier on instances created by N&amp;C's method and represented by N&amp;C's feature set, inducing a partition on the given NPs via best-first clustering.</Paragraph>
      <Paragraph position="1"> The baseline results are shown in rows 1 and 2 of Table 3, where performance is reported in terms of recall, precision, and F-measure. As we can see, the N&amp;C system outperforms the Duplicated Soon system by about 2-6% on the three ACE data sets.</Paragraph>
      <Paragraph position="2">  Our approach. Recall that our approach uses labeled data to train both the coreference classifiers and the ranking model. To ensure a fair comparison of our approach with the baselines, we do not rely on additional labeled data for learning the ranker; instead, we use half of the training texts for training classifiers and the other half for ranking purposes.</Paragraph>
      <Paragraph position="3"> Results using our approach are shown in row 3 of Table 3. Our ranking model, when trained to optimize for F-measure using both partition-based features and method-based features, consistently provides substantial gains in F-measure over both baselines. In comparison to the stronger baseline (i.e., N&amp;C), F-measure increases by 7.4, 7.2, and 4.6 for the BNEWS, NPAPER, and NWIRE data sets, respectively. Perhaps more encouragingly, gains in F-measure are accompanied by simultaneous increase in recall and precision for all three data sets.</Paragraph>
      <Paragraph position="4"> Feature contribution. In an attempt to gain additional insight into the contribution of partition-based features and method-based features, we train our ranking model using each type of features in isolation. Results are shown in rows 4 and 5 of Table 3. For the NPAPER and NWIRE data sets, we still see gains in F-measure over both baseline systems when the model is trained using either type of features. The gains, however, are smaller than those observed when the two types of features are applied in combination. Perhaps surprisingly, the results for BNEWS do not exhibit the same trend as those for the other two data sets. Here, the method-based features alone are strongly predictive of good candidate partitions, yielding even slightly better performance than when both types of features are applied. Overall, however, these results seem to suggest that both partition-based and method-based features are important to learning a good ranking model.</Paragraph>
      <Paragraph position="5"> Random ranking. An interesting question is: how much does supervised ranking help? If all of our candidate partitions are of very high quality, then ranking will not be particularly important because choosing any of these partitions may yield good results. To investigate this question, we apply a random ranking model, which randomly selects a candidate partition for each test text. Row 6 of Table 3 shows the results (averaged over five runs) when the random ranker is used in place of the supervised  ranker. In comparison to the results in row 3, we see that the supervised ranker surpasses its random counterpart by about 9-13% in F-measure, implying that ranking plays an important role in our approach.</Paragraph>
      <Paragraph position="6"> Perfect ranking. It would be informative to see whether our ranking model is performing at its upper limit, because further performance improvement beyond this point would require enlarging our set of candidate partitions. So, we apply a perfect ranking model, which uses an oracle to choose the best candidate partition for each test text. Results in row 7 of Table 3 indicate that our ranking model performs at about 1-3% below the perfect ranker, suggesting that we can further improve coreference performance by improving the ranking model.</Paragraph>
    </Section>
    <Section position="3" start_page="161" end_page="161" type="sub_section">
      <SectionTitle>
4.3 Results Using the B-CUBED Scorer
</SectionTitle>
      <Paragraph position="0"> Baseline systems. In contrast to the MUC results, the B-CUBED results for the two baseline systems are mixed (see rows 1 and 2 of Table 4). Specifically, while there is no clear winner for the NWIRE data set, N&amp;C performs better on BNEWS but worse on NPAPER than the Duplicated Soon system.</Paragraph>
      <Paragraph position="1"> Our approach. From row 3 of Table 4, we see that our approach achieves small but consistent improvements in F-measure over both baseline systems. In comparison to the better baseline, F-measure increases by 0.1, 1.1, and 2.0 for the BNEWS, NPA-PER, and NWIRE data sets, respectively.</Paragraph>
      <Paragraph position="2"> Feature contribution. Unlike the MUC results, using more features to train the ranking model does not always yield better performance with respect to the B-CUBED scorer (see rows 3-5 of Table 4). In particular, the best result for BNEWS is achieved using only method-based features, whereas the best result for NPAPER is obtained using only partition-based features. Nevertheless, since neither type of features offers consistently better performance than the other, it still seems desirable to apply the two types of features in combination to train the ranker.</Paragraph>
      <Paragraph position="3"> Random ranking. Comparing rows 3 and 6 of Table 4, we see that the supervised ranker yields a non-trivial improvement of 2-3% in F-measure over the random ranker for the three data sets. Hence, ranking still plays an important role in our approach with respect to the B-CUBED scorer despite its modest performance gains over the two baseline systems.</Paragraph>
      <Paragraph position="4"> Perfect ranking. Results in rows 3 and 7 of Table 4 indicate that the supervised ranker underperforms the perfect ranker by about 5% for BNEWS and 3% for both NPAPER and NWIRE in terms of F-measure, suggesting that the supervised ranker still has room for improvement. Moreover, by comparing rows 1-2 and 7 of Table 4, we can see that the perfect ranker outperforms the baselines by less than 5%. This is essentially an upper limit on how much our approach can improve upon the baselines given the current set of candidate partitions. In other words, the performance of our approach is limited in part by the quality of the candidate partitions, more so with B-CUBED than with the MUC scorer.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML