File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-0901_concl.xml

Size: 2,445 bytes

Last Modified: 2025-10-06 13:55:01

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0901">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1-8, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?</Title>
  <Section position="9" start_page="6" end_page="7" type="concl">
    <SectionTitle>
7 Conclusion
</SectionTitle>
    <Paragraph position="0"> We have shown that two types of human summaries, HEAD and HUM, can be useful for relevance assessment in that they help a user achieve 70-85% agreement in relevance judgments. We observed a 65% reduction in judgment time between full texts and summaries. These findings are important in that they establish the usefulness of summarization and they support research and development of additional summarization methods, including automatic methods.</Paragraph>
    <Paragraph position="1"> We introduced a new method for measuring agreement, Relevance-Prediction, which takes a subject's full-text judgment as the standard against which the same subject's summary judgment is measured. Because Relevance-Prediction was more reliable than LDC-Agreement judgments, we encourage others to use this measure in future summarization evaluations.</Paragraph>
    <Paragraph position="2"> Using this new method, we were able to find positive correlations between relevance assessments and ROUGE scores for HUM and HEAD surrogates, where only  negative correlations were found using LDC-Agreement scores. We found that both the Relevance-Prediction and the ROUGE-1 scores were higher for human-generated summaries than for the original headlines. It appears that most of the difference is induced by surrogates that are eye-catchers (rather than true summaries), where both agreement and ROUGE scores are low.</Paragraph>
    <Paragraph position="3"> Our future work will include further experimentation with automatic summarization methods to determine the level of Relevance-Prediction. We aim to determine how well automatic summarizers help users complete tasks, and to investigate which automatic summarizers perform better than others. We also plan to test for correlations between ROUGE and human task performance with automatic summaries, to further investigate whether ROUGE is a good predictor of human task performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML