XML Viewer - c04-1108

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/c04-1108_evalu.xml
Size: 7,992 bytes
Last Modified: 2025-10-06 13:59:08
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1108">
  <Title>Improving Chronological Sentence Ordering by Precedence Relation</Title>
  <Section position="4" start_page="0" end_page="0" type="evalu">
    <SectionTitle>
3 Evaluation
</SectionTitle>
    <Paragraph position="0"> In this section we describe our experiment to test the effectiveness of the proposed method.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Experiment and evaluation metrics
</SectionTitle>
      <Paragraph position="0"> We conducted an experiment of sentence ordering through multi-document summarization to test the effectiveness of the proposed method.</Paragraph>
      <Paragraph position="1"> We utilized the TSC-3 (Hirao et al., to appear in 2004) test collection, which consists of 30 sets of multi-document summarization tasks. For more information about TSC-3 task, see the workshop proceedings. Performing an important sentence extraction (Okazaki et al., to appear in 2004) up to the specified number of sentences (approximately 10% of summarization rate), we made a material for a summary (i.e., extracted sentences) for each task. We order the sentences by six methods: human-made ordering (HO) as the highest anchor; random ordering (RO) as the lowest anchor; chronological ordering (CO) (i.e., phase 2 only); chronological ordering with topical segmentation (COT) (i.e., phases 1 and 2); proposed method without topical segmentation (PO) (i.e., phases 2 and 3); and proposed method with topical segmentation (POT)). We asked human judges to evaluate sentence ordering of these summaries.</Paragraph>
      <Paragraph position="2"> The first evaluation task is a subjective grading where a human judge marks an ordering of summary sentences on a scale of 4: 4 (perfect), 3 (acceptable), 2 (poor), and 1 (unacceptable). We give a clear criterion of scoring to the judges as follows. A perfect summary is a text that we cannot improve any further by re-ordering. An acceptable summary is a one that makes sense and is unnecessary to be revised even though there may be some room for improvement in terms of readability. A poor summary is a one that loses a thread of the story at some places and requires minor amendment to bring it up to the acceptable level. An unacceptable summary is a one that leaves much to be improved and requires overall restructuring rather than partial revision. Additionally, we inform the judges that summaries were made of the same set of extracted sentences and only sentence ordering made differences between the summaries in order to avoid any disturbance in rating.</Paragraph>
      <Paragraph position="3"> In addition to the rating, it is useful that we examine how close an ordering is to an acceptable one when the ordering is regarded as poor. Considering that several sentence-ordering patterns are acceptable for a given summary, we An ordering to evaluate: The corrected ordering: s5, s6, s7, s8, s1, s2, s9, s3, s4 s5, s6, s7, s9, s2, s8, s1, s3, s4  think that it is valuable to measure the degree of correction because this metric virtually requires a human corrector to prepare a correct answer for each ordering in his or her mind. Therefore, a human judge is supposed to illustrate how to improve an ordering of a summary when he or she marks the summary with poor in the rating task. We restrict applicable operations of correction to move operation so as to keep the minimum correction of the ordering. We define a move operation here as removing a sentence and inserting the sentence into an appropriate place (see Figure 5).</Paragraph>
      <Paragraph position="4"> Supposing a sentence ordering to be a rank, we can calculate rank correlation coefficient of a permutation of an ordering pi and a permutation of the reference ordering s. Let {s1,...,sn} be a set of summary sentences identified with index numbers from 1 to n. We define a permutation pi [?] Sn to denote an ordering of sentences where pi(i) represents an order of sentence si. Similarly, we define a permutation s [?] Sn to denote the corrected ordering. For example, the pi and s in Figure 5 will be:</Paragraph>
      <Paragraph position="6"> Spearman's rank correlation ts(pi,s) and Kendall's rank correlation tk(pi,s) are known as famous rank correlation metrics.</Paragraph>
      <Paragraph position="8"> ings in percent figures.</Paragraph>
      <Paragraph position="9"> where sgn(x) = 1 for x &gt; 0 and [?]1 otherwise. These metrics range from [?]1 (an inverse rank) to 1 (an identical rank) via 0 (a non-correlated rank). In the example shown in Equations 2 and 3 we obtain ts(pi,s) = 0.85 and tk(pi,s) = 0.72. We propose another metric to assess the degree of sentence continuity in reading, tc(pi,s):</Paragraph>
      <Paragraph position="11"> from 0 (no continuity) to 1 (identical). The summary in Figure 5 may interrupt judge's reading after sentence S7, S1, S2 and S9 as he or she searches a next sentence to read. Hence, we observe four discontinuities in the ordering and calculate sentence continuity tc(pi,s) = (9[?]4)/9 = 0.56.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Results
</SectionTitle>
      <Paragraph position="0"> Table 1 shows distribution of rating score of each method in percent figures. Judges marked about 75% of human-made ordering (HO) as either perfect or acceptable while they rejected as many as 95% of random ordering (RO). Chronological ordering (CO) did not yield satisfactory result losing a thread of 63% summaries although CO performed much better than RO.</Paragraph>
      <Paragraph position="1"> Topical segmentation could not contribute to ordering improvement of CO as well: COT is slightly worse than CO. After taking an in-depth look at the failure orderings, we found the topical clustering did not perform well during this test. We suppose the topical clustering could not prove the merits with this test collection because the collection consists of relevant articles retrieved by some query and polished well by a human so as not to include unrelated articles to a topic.</Paragraph>
      <Paragraph position="2"> On the other hand, the proposed method (PO) improved chronological ordering much better than topical segmentation. Note that the sum of perfect and acceptable ratio jumped up from 36% (CO) to 55% (PO). This shows the ordering refinement by precedence relation improves chronological ordering by pushing poor ordering to an acceptable level.</Paragraph>
      <Paragraph position="3"> Table 2 reports closeness of orderings to the corrected ones with average scores (AVG) and the standard deviations (SD) of the three metrics ts, tk and tc. It appears that average figures shows similar tendency to the rating task with three measures: HO is the best; PO is better than CO; and RO is definitely the worst. We applied one-way analysis of variance (ANOVA) to test the effect of four different methods (RO, CO, PO and HO). ANOVA proved the effect of the different methods (p &lt; 0.01) for three metrics. We also applied Tukey test to compare the difference between these methods. Tukey test revealed that RO was definitely the worst with all metrics. However, Spearman's rank correlation tS and Kendall's rank correlation tk failed to prove the significant difference between CO, PO and HO. Only sentence continuity tc proved PO is better than CO; and HO is better than CO (a = 0.05). The Tukey test proved that sentence continuity has better conformity to the rating results and higher discrimination to make a comparison.</Paragraph>
      <Paragraph position="4"> Table 3 shows closeness of orderings to ones made by human (all results of HO should be 1 by necessity). Although we found RO is clearly the worst as well as other results, we cannot find the significant difference between CO, PO, and HO with all metrics. This result presents to the difficulty of automatic evaluation by preparing one correct ordering.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML