File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/p02-1058_evalu.xml

Size: 6,876 bytes

Last Modified: 2025-10-06 13:58:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="P02-1058">
  <Title>From Single to Multi-document Summarization: A Prototype System and its Evaluation</Title>
  <Section position="7" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
5 Results and Discussion
</SectionTitle>
    <Paragraph position="0"> We present the performance of NeATS in DUC-2001 in content and quality measures.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Content
</SectionTitle>
      <Paragraph position="0"> With respect to content, we computed Retention1, Retention w, and Precisionp using the formulas defined in the previous section.</Paragraph>
      <Paragraph position="1"> The scores are shown in Table 1 (overall average and per size). Analyzing all systems' results according to these, we made the following observations.</Paragraph>
      <Paragraph position="2">  (1) NeATS (system N) is consistently ranked among the top 3 in average and per size Retention1 and Retention w.</Paragraph>
      <Paragraph position="3"> (2) NeATS's performance for averaged pseudo precision equals human's at about 58% (Pp all). (3) The performance in weighted retention is really low. Even humans6 score only 29% (Rw  all). This indicates low inter-human agreement (which we take to reflect the undefinedness of the 'generic summary' task). However, the unweighted retention of humans is 53%. This suggests assessors did write something similar in their summaries but not exactly the same; once again illustrating the difficulty of summarization evaluation.</Paragraph>
      <Paragraph position="4"> (4) Despite the low inter -human agreement, humans score better than any system. They outscore the nearest system by about 11% in averaged unweighted retention (R1 all : 53% vs. 42%) and weighted retention (Rw all : 29% vs. 18%). There is obviously still considerable room for systems to improve.</Paragraph>
      <Paragraph position="5"> (5) System performances are separated into two major groups by baseline 2 (B2: coverage baseline) in averaged weighted retention. This confirms that lead sentences are good summary sentence candidates and that one does need to cover all documents in a topic to achieve reasonable performance in multi-document summarization. NeATS's strategies of filtering sent ences by position and adding lead sentences to set context are proved  effective.</Paragraph>
      <Paragraph position="6"> (6) Different metrics result in different  performance rankings. This is demonstrated by the top 3 systems T, N, and Y. If we use the averaged unweighted retention (R1 all), Y is 6 NIST assessors wrote two separate summaries per topic. One was used to judge all system summaries and the two baselines. The other was used to determine the (potential) upper bound.</Paragraph>
      <Paragraph position="7">  the best, followed by N, and then T; if we choose averaged weighted retention (Rw all), T is the best, followed by N, and then Y. The reversal of T and Y due to different metrics demonstrates the importance of common agreed upon metrics. We believe that metrics have to take coverage score (C, Section 4.1.1) into consideration to be reasonable since most of the content sharing among system units and model units is partial. The recall at threshold t, Recallt (Section 4.1.1), proposed by (McKeown et al. 2001), is a good example. In their evaluation, NeATS ranked second at t=1, 3, 4 and first at t=2.</Paragraph>
      <Paragraph position="8"> (7) According to Table 1, NeATS performed better on longer summaries (400 and 200 words) based on weighted retention than it did on shorter ones. This is the result of the sentence extraction-based nature of NeATS.</Paragraph>
      <Paragraph position="9"> We expect that systems that use syntax-based algorithms to compress their output will thereby gain more space to include additional important material. For example, System Y was the best in shorter summaries. Its 100and 50-word summaries contain only important headlines. The results confirm this is a very effective strategy in composing short summaries. However, the quality of the summaries suffered because of the unconventional syntactic structure of news headlines (Table 2).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Quality
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the macro-averaged scores for the humans, two baselines, and 12 systems.</Paragraph>
      <Paragraph position="1"> We assign a score of 4 to all, 3 to most, 2 to some, 1 to hardly any, and 0 to none. The value assignment is for convenience of computing averages, since it is more appropriate to treat these measures as stepped values instead of continuous ones. With this in mind, we have the following observations.</Paragraph>
      <Paragraph position="2"> (1) Most systems scored well in grammaticality. This is not a surprise since most of the participants extracted sentences as summaries.</Paragraph>
      <Paragraph position="3"> But no system or human scored perfect in grammaticality. This might be due to the artifact of cutting sentences at the 50, 100, 200, and 400 words boundaries. Only system Y scored lower than 3, which reflects its headline inclusion strategy.</Paragraph>
      <Paragraph position="4"> (2) When it came to the measure for cohesion the results are confusing. If even the human-made summaries score only 2.74 out of 4, it is unclear what this category means, or how the assessors arrived at these scores. However, the humans and baseline 1 (lead baseline) did score in the upper range of 2 to 3 and all others had scores lower than 2.5. Some of the systems (including B2) fell into the range of 1 to 2 meaning some or hardly any cohesion.</Paragraph>
      <Paragraph position="5"> The lead baseline (B1), taking the first 50, 100, 200, 400 words from the last document of a topic, did well. On the contrary, the coverage baseline (B2) did poorly. This indicates the difficulty of fitting sentences from different documents together. Even selecting continuous sentences from the same document (B1) seems not to work well. We need to define this metric more clearly and improve the capabilities of systems in this respect.</Paragraph>
      <Paragraph position="6"> (3) Coherence scores roughly track cohesion scores. Most systems did better in coherence than in cohesion. The human is the only one scoring above 3. Again the room for improvement is abundant.</Paragraph>
      <Paragraph position="7"> (4) NeATS did not fare badly in quality measures. It was in the same categories as other top performers: grammaticality is between most and all, cohesion, some and most , and coherence, some and most. This indicates the strategies employed by NeATS (stigma word filtering, adding lead sentence, and time annotation) worked to some extent but left room for improvement.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML