File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/03/n03-1020_evalu.xml

Size: 3,524 bytes

Last Modified: 2025-10-06 13:58:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-1020">
  <Title>Automatic Evaluation of Summaries Using N-gram Co-Occurrence Statistics</Title>
  <Section position="6" start_page="400" end_page="400" type="evalu">
    <SectionTitle>
5 Conclusions
</SectionTitle>
    <Paragraph position="0"> In this paper, we gave a brief introduction of the manual summary evaluation protocol used in the Document Understanding Conference. We then discussed the IBM BLEU MT evaluation metric, its application to summary evaluation, and the difference between precision-based BLEU translation evaluation and recall-based DUC summary evaluation. The discrepancy led us to examine the effectiveness of individual n-gram co-occurrence statistics as a substitute for expensive and error-prone manual evaluation of summaries. To evaluate the performance of automatic scoring metrics, we proposed two test criteria. One was to make sure system rankings produced by automatic scoring metrics were similar to human rankings. This was quantified by Spearmans rank order correlation coefficient and three other parametric correlation coefficients. Another was to compare the statistical significance test results between automatic scoring metrics and human assessments. We used recall and precision of the agreement between the test statistics results to identify good automatic scoring metrics.</Paragraph>
    <Paragraph position="1"> According to our experiments, we found that unigram co-occurrence statistics is a good automatic scoring metric. It consistently correlated highly with human assessments and had high recall and precision in significance test with manual evaluation results. In contrast, the weighted average of variable length n-gram matches derived from IBM BLEU did not always give good correlation and high recall and precision. We surmise that a reason for the difference between summarization and machine translation might be that extraction-based summaries do not really suffer from grammar problems, while translations do. Longer n-grams tend to score for grammaticality rather than content.</Paragraph>
    <Paragraph position="2"> It is encouraging to know that the simple unigram co-occurrence metric works in the DUC 2001 setup. The reason for this might be that most of the systems participating in DUC generate summaries by sentence extraction. We plan to run similar experiments on DUC 2002 data to see if unigram does as well. If it does, we will make available our code available via a website to the summarization community.</Paragraph>
    <Paragraph position="3"> Although this study shows that unigram co-occurrence statistics exhibit some good properties in summary evaluation, it still does not correlate to human assessment 100% of the time. There is more to be desired in the recall and precision of significance test agreement with manual evaluation. We are starting to explore various metrics suggested in Donaway et al. (2000). For example, weight n-gram matches differently according to their information content measured by tf, tfidf, or SVD. In fact, NIST MT automatic scoring metric (NIST 2002) already integrates such modifications.</Paragraph>
    <Paragraph position="4"> One future direction includes using an automatic question answer test as demonstrated in the pilot study in SUMMAC (Mani et al. 1998). In that study, an automatic scoring script developed by Chris Buckley showed high correlation with human evaluations, although the experiment was only tested on a small set of</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML