File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/w99-0634_evalu.xml

Size: 4,904 bytes

Last Modified: 2025-10-06 14:00:41

<?xml version="1.0" standalone="yes"?>
<Paper uid="W99-0634">
  <Title>Corpus-Based Learning for Noun Phrase Coreference Resolution</Title>
  <Section position="7" start_page="287" end_page="289" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> In order to evaluate the performance of our learning approach to coreference resolution on a common data set, we utilized the annotated corpus and scoring program from MUC-6, which assembled a set of newswire documents annotated with coreference chains. Although we did not participate in MUC-6, we were able to obtain the MUC-6 training and test corpus from the MUC organizers for research purpose. 1 30 dry-run documents annotated with coreference information were used as the training documents for our coreference engine. After training the engine, we tested its accuracy on the 30 formal test documents in MUC-6. These 30 test documents are exactly those used to evaluate the systems that participated in the MUC-6 evaluation.</Paragraph>
    <Paragraph position="1"> Our implemented system runs on a Pentium II 400MHz PC. The total size of the 30 training documents is close to 13,000 words. It took less than five minutes to generate the training examples from these training documents. The training time for the C4.5 algorithm to generate a decision tree from all the training examples was about 30 seconds. The decision tree classifier learned (using a pruning confidence level of 25%) is shown in Figure 1.</Paragraph>
    <Paragraph position="2"> One advantage of using a decision tree learning algorithm is that the resulting decision tree classifier built can be interpreted by human.</Paragraph>
    <Paragraph position="3"> The decision tree in Figure 1 seems to encapsulate a reasonable rule-of-thumb that matches our intuitive linguistic notion of when two noun phrases can co-refer. It is also interesting to note that only five out of the ten available features in the training examples are actually used in the final decision tree built.</Paragraph>
    <Paragraph position="4">  When given new test documents, the output of the coreference engine is in the form of SGML files with the coreference chains properly annotated according to the MUC-6 guidelines. The time taken to generate the coreference chains for 30 test documents of close to 14,000 words was less than three minutes. We then used the scorer program of MUC-6 to generate the recall and precision score for our coreference engine.</Paragraph>
    <Paragraph position="5"> Our coreference engine achieves a recall of 52% and a precision of 68%, yielding a balanced F-measure of 58.9%. We plotted the score of our coreference engine (square-shaped) against the other official test scores of MUC-6 systems (cross-shaped) in Figure 2. We also plotted the learning curve of our coreference engine in Figure 3, showing its accuracy averaged over five random trials when trained on 5, 10, ..., 30 training documents.</Paragraph>
    <Paragraph position="6"> Our score is in the upper region of the MUC-6 systems. We performed a simple two-tailed, paired t-test at p = 0.05 to determine whether the difference between our system's F-measure scores and each of the other MUC-6 systems' F-measure scores on the 30 formal test documents is statistically significant. We found that at the 95% significance level, our system performed worse than one, better than two, and as well as the rest of the MUC-6 systems. Our result is encouraging as it indicates that a learning approach using relatively shallow features and a small number of training documents can lead to scores that are comparable to systems built us- null ing non-learning approaches.</Paragraph>
    <Paragraph position="7"> It should be noted that the accuracy of our coreference resolution engine depends to a large extent on the performance of the NLP modules that are executed before the coreference engine. Our current learning-based, HMM named entity recognition module is trained on 318 documents (a disjoint set from the 30 formal test documents) tagged with named entities, and its score on the MUC-6 named entity task for the 30 formal test documents is only 88.9%, which is not considered very high by MUC-6 standard.</Paragraph>
    <Paragraph position="8"> For example, our named entity recognizer could  not identify the two named entities &amp;quot;USAir&amp;quot; and &amp;quot;Piedmont&amp;quot; in the expression &amp;quot;USAir and Piedmont&amp;quot; but instead treat it as one single named entity. Also, some of the features such as number agreement, gender agreement and semantic class agreement are difficult to determine at times. For example, &amp;quot;they&amp;quot; is sometimes used to refer to &amp;quot;the government&amp;quot; even though superficially both do not seem to agree in number. All these problems hurt the performance of the coreference engine.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML