File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/05/w05-0901_evalu.xml

Size: 4,855 bytes

Last Modified: 2025-10-06 13:59:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-0901">
  <Title>Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pages 1-8, Ann Arbor, June 2005. c(c)2005 Association for Computational Linguistics A Methodology for Extrinsic Evaluation of Text Summarization: Does ROUGE Correlate?</Title>
  <Section position="8" start_page="6" end_page="6" type="evalu">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> Our results suggest that ROUGE may be sensitive to the style of summarization that is used. As we observed above, many of the HEAD surrogates were not actually summaries of the full text, but were eye-catchers. Often, these surrogates did not allow the subject to judge relevance correctly, resulting in lower agreement. In addition, these same surrogates often did not use a high percentage of words that were actually from the story, resulting in low ROUGE scores. (We noticed that most words in the HUM surrogates appeared in the corresponding stories.) There were three consequences of this difference between HEAD and HUM: (1) The rate of agreement was lower for HEAD than for HUM; (2) The average ROUGE score was lower for HEAD than for HUM; and (3) The correlation of ROUGE scores with agreement was higher for HEAD than for HUM.</Paragraph>
    <Paragraph position="1"> A further analysis supports the (somewhat counterintuitive) third point above. Although the ROUGE scores of true positives (and true negatives) were significantly lower for HEAD surrogates (0.2127 and 0.2162) than for HUM surrogates (0.2696 and 0.2715), the number of false negatives was substantially higher for HEAD surrogates than for HUM surrogates. These cases corresponded to much lower ROUGE scores for HEAD surrogates (0.1996) than for HUM (0.2586) surrogates.</Paragraph>
    <Paragraph position="2"> A summary of this analysis is given in Table 6, where true positives and negatives are indicated by Rel/Rel and NonRel/NonRel, respectively, and false positives and negatives are indicated by Rel/NonRel and NonRel/Rel, respectively.10 The numbers in parentheses after each ROUGE score refer to the standard deviation for that 10We also included (average) elapsed times for summary judgments in each of the four categories. One might expect a &amp;quot;relevant&amp;quot; judgment to be much quicker than a &amp;quot;non-relevant&amp;quot; judgment (since the latter might require reading the full summary). However, it turned out non-relevant judgments did not always take longer. In fact, the NonRel/NonRel cases took considerably less time than the Rel/Rel and Rel/NonRel cases. On the other hand, the NonRel/Rel cases took considerably more time--almost as much time as reading the full text documents-an indication that the subjects may have re-read the summary a number of times, perhaps vacillating back and forth. Still, the overall time savings was significant, given that the vast majority of the non-relevant judgments were in the NonRel/NonRel category.</Paragraph>
    <Paragraph position="3"> score. This was computed as follows:</Paragraph>
    <Paragraph position="5"> where N is the number of surrogates in a particular judgment category (e.g., N = 245 for the HEAD-based Non-Rel/Rel judgments), xi is the ROUGE score for the ith surrogate, and -r is the average of all ROUGE scores in that category.</Paragraph>
    <Paragraph position="6"> Although there were very few false positives (less than 6% for both HEAD and HUM), the number of false negatives (NonRel/Rel) was particularly high for HEAD (50% higher than for HUM). This difference was statistically significant at p&lt;0.01 using the t-test. The large number of false negatives with HEAD may be attributed to the eye-catching nature of these surrogates. A subject may be misled into thinking that this surrogate is not related to an event because the surrogate does not contain words from the event description and is too broad for the subject to extract definitive information (e.g., the surrogate There he goes again!). Because the false negatives were associated with the lowest average ROUGE score (0.1996), we speculate that, if a correlation exists between Relevance-Prediction and ROUGE, the false negatives may be a major contributing factor.</Paragraph>
    <Paragraph position="7"> Based on this experiment, we conjecture that ROUGE may not be a good method for measuring the usefulness of summaries when the summaries are not extractive. That is, if someone intentionally writes summaries that contain different words than the story, the summaries will also likely contain different words than a reference summary, resulting in low ROUGE scores. However, the summaries, if well-written, could still result in high agreement with the judgments made on the full text.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML