File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0406_concl.xml
Size: 1,965 bytes
Last Modified: 2025-10-06 13:53:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0406"> <Title>Manual and Automatic Evaluation of Summaries</Title> <Section position="7" start_page="31" end_page="31" type="concl"> <SectionTitle> 6 Conclusions </SectionTitle> <Paragraph position="0"> We described manual and automatic evaluation of single and multi-document summarization in DUC-2001. We showed the instability of We thank Kishore Papineni for sending us BLEU 1.0.</Paragraph> <Paragraph position="1"> Table 2. Pairwise relative system performance (multi-document summarization task).</Paragraph> <Paragraph position="3"> human evaluations and the need to consider this factor when comparing system performances.</Paragraph> <Paragraph position="4"> As we factored in the instability, systems tended to form separate performance groups. One should treat with caution any interpretation of performance figures that ignores this instability. Automatic evaluation of summaries using accumulative n-gram matching scores seems promising. System rankings using NAMS and retention ranking had a Spearman rank-order correlation coefficient above 97%. Using stemmers improved the correlation. However, satisfactory correlation is still elusive. The main problem we ascribe to automated summary evaluation is the large expressive range of English since human summarizers tend to create fresh text. No n-gram matching evaluation procedure can overcome the paraphrase or synonym problem unless (many) model summaries are available.</Paragraph> <Paragraph position="5"> We conclude the following: (1) We need more than one model summary although we cannot estimate how many model summaries are required to achieve reliable automated summary evaluation.</Paragraph> <Paragraph position="6"> (2) We need more than one evaluation for each summary against each model summary.</Paragraph> <Paragraph position="7"> (3) We need to ensure a single rating for each system unit.</Paragraph> </Section> class="xml-element"></Paper>