File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0502_metho.xml

Size: 23,898 bytes

Last Modified: 2025-10-06 14:08:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0502">
  <Title>Sub-event based multi-document summarization</Title>
  <Section position="4" start_page="3" end_page="3" type="metho">
    <SectionTitle>
3. Article Corpus
</SectionTitle>
    <Paragraph position="0"> Our study involves two experiments carried out on one corpus of news articles. The article corpus was selected from a cluster of eleven articles describing the 2000 crash of Gulf Air flight 072. From these articles we chose a corpus of five articles, containing a total of 159 sentences.</Paragraph>
    <Paragraph position="1"> All the articles cover a single news event, the plane crash and its aftermath. The articles were gathered on the web from sources reporting on the event as it unfolded, and come from various news agencies, such as ABC News, Fox News, and the BBC. All of the articles give some discussion of the events leading up to and following the crash, with particular articles focusing on areas of special interest, such as the toll on Egypt, from where many of the passengers had come. The article titles in Table 1, below, illustrate the range of sub-events that are covered under the crash topic.</Paragraph>
  </Section>
  <Section position="5" start_page="3" end_page="3" type="metho">
    <SectionTitle>
4. Experiment 1: Sub-Event Analysis
</SectionTitle>
    <Paragraph position="0"> Our first experiment involved having human judges analyze the sentences in our corpus for degree of saliency to a series of sub-events comprising the topic.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
4.1 Description of Sub-Event User Study
</SectionTitle>
      <Paragraph position="0"> The goal of this experiment was to study the effectiveness of breaking a news topic down into sub-events, in order to capture not simply salience, but also diversity (Goldstein, 1998).</Paragraph>
      <Paragraph position="1"> The sub-events were chosen to cover all of the material in the reports and to represent the most significant aspects of the news topic. For the Gulf Air crash, we determined that the sub-events were:  1. The plane takes off 2. Something goes wrong 3. The plane crashes 4. Rescue and recovery effort 5. Gulf Air releases information 6. Government agencies react 7. Friends, relatives and nations mourn 8. Black box(es) are searched for 9. Black box(es) are recovered 10. Black box(es) are sent for analysis  We instructed judges to rank the degree of sentence relevance to each sub-event. Judges were instructed to use a scale, such that a score of ten indicated that the sentence was critical to the sub-event, and a score of 0 indicated that the sentence was irrelevant. Thus, the judges processed the 159 sentences from 5 documents ten times, once pertaining to each sub-event. This experiment produced for each judge 1590 data points which were analyzed according to the methods described in the next section.</Paragraph>
      <Paragraph position="2"> We used the data on the relevance of the sentences to the sub-events to calculate inter-judge agreement. In this manner, we determined which sentences had the overall highest relevance to each subevent. We used this ranking to produce summaries at different levels of compression.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="3" end_page="3" type="metho">
    <SectionTitle>
5. Methods for Producing Summaries
</SectionTitle>
    <Paragraph position="0"> To gather data about the effectiveness of dividing news topics into their sub-events for creating summaries, we utilized data from human judges, upon which we manually performed three algorithms. These algorithms and their application are described in detail below. We were interested to determine if the Round Robin method (described below,) which has been used by McKeown et al. (1999), Boros et al. (2001) and by Hatzivassiloglou et al. (2001), was the most effective.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.1 Sub-Event-Based Algorithms
</SectionTitle>
      <Paragraph position="0"> After collecting judges' scores of relevance for each sentence for each subtopic, we then ranked the sentences according to three different algorithms to create multiple-document summaries. From this data, we created summary extracts using three algorithms, as follows: Algorithm 1) Highest Score Anywhere - pick the sentence which is most relevant to any subevent, no matter the subevent; pick the next sentence which is most relevant to any subevent, etc.</Paragraph>
      <Paragraph position="1"> Algorithm 2) Sum of All Scores - for each sentence, sum its relevance score for each cluster, pick the sentence with the highest sum; then pick the sentence with the second highest sum, etc.</Paragraph>
      <Paragraph position="2"> Algorithm 3) Round Robin - pick the sentence which has the most relevance for subevent 1, pick the sentence with the most relevance for subevent 2, etc. After picking 1 sentence from each subevent, pick the sentence with the 2nd best relevance to subevent 1, etc.</Paragraph>
      <Paragraph position="3"> Judge 1 Judge 2 Judge 3 Judge 1 Judge 2 Judge 3 Judge 1 Judge 2 Judge 3</Paragraph>
      <Paragraph position="5"> on the degree of sentence relevancy. Some sentences are used in more than one sub-event.</Paragraph>
      <Paragraph position="6"> Algorithm 1 - Highest Score Anywhere (HSA): This algorithm was produced by summing the data across all judges to produce a total inter-judge score and keeping sub-events distinct, to see the inter-judge utility scores given to sub-events. We ordered the sentences by ranking these scores in descending order and omitting duplicates, to produce the ten and twenty percent extracts. For example, with data from seven judges on ten sub-events, the highest possible score for each sentence was seventy. Thus seventy was the highest score.</Paragraph>
      <Paragraph position="7"> In the case that there was a tie between sentences, we ordered them by sub-event number (first sub-event first and tenth sub-event last).</Paragraph>
      <Paragraph position="8"> Algorithm 2 - Sum of All Scores (SAS): This algorithm was produced by summing the data across all judges to produce a total inter-judge score, and combining events so that we could see the utility scores given across sub-events. We ordered the sentences by ranking these cross-event inter-judge utility scores in descending order and omitting duplicates, to produce the ten and twenty percent extracts.</Paragraph>
      <Paragraph position="9"> Algorithm 3 - Round Robin (RR): This algorithm was produced by summing the data across all judges to produce a total inter-judge score and keeping sub-events distinct, to see the inter-judge utility scores given to sub-events. We ordered the sentences by ranking the inter-judge utility scores in descending order within each sub-event. We then chose the top sentence from each sub-event (one through ten), the second highest sentence from each sub-event, and so on, omitting duplicates, until we had produced the ten and twenty percent extracts.</Paragraph>
      <Paragraph position="10"> In this manner, we created thirty-six sub-event-based summary extracts - six clusters, three algorithms, two compression rates - which we then analyzed.</Paragraph>
      <Paragraph position="11"> The Sum of All Scores algorithm most closely replicates a centroid-based summary by combining the ten sub-event scores into one pan-topic score for each sentence. Further, the Sum of All Scores algorithm is the sub-event algorithm most likely to pick sentences with a high &amp;quot;general relevance,&amp;quot; which is what the baseline relative utility scores are meant to capture. In contrast, the Highest Score Anywhere algorithm maintains the structure of the sub-event breakdown, preferring the highest score in any sub-event. Likewise, the Round Robin algorithm maintains the sub-event breakdown, but rather than preferring the highest score in any event, it selects the highest score from each sub-event, serially; this algorithm most closely resembles the Lead-based automatic summarizer, and is at the heart of Hatzivassiloglou et al.'s (2001) SimFinder.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
5.2 Automatic Multi-Document
Summaries
</SectionTitle>
      <Paragraph position="0"> The three automatic summarization methods that we used in our comparison have already been established.</Paragraph>
      <Paragraph position="1"> We compared our manual summaries to these established automatic multiple-document summarization methods: Centroid-based (MEAD), Lead-based and Random.</Paragraph>
      <Paragraph position="2"> MEAD: First, we produced summaries using the MEAD system. MEAD produces a centroid (vector) for all of the sentences and then selects those sentences which are closest to the centroid. MEAD measures similarity with the cosine measurement and TF*IDF weighting. Mead also adjusts a sentence's score based on its length, its position in the original document and its similarity to sentences already selected for the extract. (Radev et al, 2000).</Paragraph>
      <Paragraph position="3"> Lead-Based: We also produced summaries by the Lead-based method. This method involves assigning the highest score to the first sentence in each article, then the second sentence in each article, and so on.</Paragraph>
      <Paragraph position="4"> Random: We created summaries with every possible combination of sentences for each summary length. This allowed us to compute the average random relative utility score.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="3" end_page="3" type="metho">
    <SectionTitle>
6. Relative Utility
</SectionTitle>
    <Paragraph position="0"> Following (Radev et al., 2000), we used relative utility as our metric. Relative utility was chosen for advantages in a couple of areas.</Paragraph>
    <Paragraph position="1"> Relative utility is a metric which measures sentence relevance. It allows us to distinguish the degree of importance between sentences, providing a more flexible model for evaluating sentence utility (Radev et al., 2000). Studies involving sentence extraction have often been predicated upon determining the usefulness of sentences as either useful or non-useful (Allan et al.</Paragraph>
    <Paragraph position="2"> 2001b). However, determining the usefulness of sentences is more complex than a simple a binary choice can account for. We employ a relative utility metric to account for subtleties in determining the saliency of sentences.</Paragraph>
    <Paragraph position="3"> Another advantage of the relative utility metric is that, although human judges have often agree very little on which sentences belong in a summary, they tend to agree on how important sentences are to a topic or event; thus, relative utility makes it possible to leverage this agreement.</Paragraph>
    <Paragraph position="4"> To calculate relative utility, we had human subjects assign a score to each sentence in a corpus of articles. The score reflects the subject's perception of a sentence's relevance to the overall topic of the corpus. The scale our judges were instructed to use ranged from zero to ten. A score of zero indicated that the sentence was irrelevant; whereas a score of ten indicated that the sentence was crucial to the understanding of the topic. So that judges' scores can be fairly compared, each judge's scores are normalized by the highest score and lowest score which that judge gives any sentence.</Paragraph>
    <Paragraph position="5"> Relative utility is determined by first adding together the utility scores given to each sentence by each judge. Each sentence in a summary is then awarded the total of the judges' scores for that sentence. Finally, the summary's total score is divided by the best possible score, given the size of the summary.</Paragraph>
    <Paragraph position="6"> For example, let us assume that a cluster has three sentences (A, B and C) which have been judged by two judges in the following way: A 10, 9, B 8, 6 and C 6, 5. That is, judge 1 gives sentence A a 10, while judge 2 gives sentence A a 9, and so on. In the first step, we sum the judges' scores for each sentence, yielding (A 19, B 14, C 11). If a summarizer has to pick a 2 sentence summary, and it picks A and C, its utility score is 30.</Paragraph>
    <Paragraph position="7"> We then divide this score by the best possible 2 sentence summary, in this case A and B, whose utility is 33, yielding a final relative utility of .91.</Paragraph>
  </Section>
  <Section position="8" start_page="3" end_page="3" type="metho">
    <SectionTitle>
7. Extract Creation
</SectionTitle>
    <Paragraph position="0"> Summaries can be created by abstracting or extracting [Mani, 2001]. For purposes of comparison with MEAD, an extractive summarizer, we used an extractive method to create all six summary types: sum of all scores, highest score anywhere, round robin, MEAD, lead-based, and random.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
7.1 Clusters
</SectionTitle>
      <Paragraph position="0"> Each of the summarization methods was employed at both ten and twenty percent compression rates. We used the summaries thus produced to consider how compression rates could influence the effectiveness of the six summarization methods. In our first experiment, we additionally looked at varying combinations of the five articles, such that we examined the corpus in six clusters, as shown in the figure below. We selected these article combinations to maximize the diversity of sources in each cluster, and to achieve a variable number of articles in a cluster.</Paragraph>
      <Paragraph position="1">  8. Results from the first experiment Some of our results met our expectations, while others surprised us (see Table 3). The Sum of All Scores manual algorithm produces the best summaries at the twenty percent compression rate. At the ten percent compression rate, data shows Lead-based summaries performing best, with the Sum of All Scores algorithm coming in right behind. Mead scores in the mid-range as expected, for both compression rates, just behind the Round Robin Algorithm. In contrast, the random method leads in low scores, with the Highest Score Anywhere algorithm coming in only slightly higher. Random sets the lower bound. Here, we discuss the details of our findings and their significance in more detail.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
8.1 Manual Algorithms
</SectionTitle>
      <Paragraph position="0"> Both the Sum of All Scores, and Round Robin algorithms performed better than MEAD, with the highest score anywhere algorithm performing less well. This result is reasonable, based upon the characteristics of the algorithms. Algorithm 2 (SAS), the best performer among the manual summaries, used the sum of all scores across events and judges; thus, it tapped into which sentences were most popular overall. Algorithm 3 (RR), also better than MEAD, used a round robin technique, which, similarly to the Lead-based results, tapped into the pyramid quality of news journalism. Algorithm 1 (HSA), poorest performer second to Random, used the highest score in any event by inter-judge score; its weakness was in negating both the benefits of the pyramid structure of the judges' sentence rankings, as well as the popularity of sentences across events.</Paragraph>
    </Section>
    <Section position="3" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
8.2 Compression Rate
</SectionTitle>
      <Paragraph position="0"> For extracts at the ten percent compression rate, Lead-based sets the upper, and random the lower, bound.</Paragraph>
      <Paragraph position="1"> However, the Sum of All Scores algorithm performed better at the twenty percent compression rate, beating Lead-based for best summaries. Each method produced better summaries overall at ten percent compression rate, except for Algorithm 2, which performed better at the twenty percent compression rate.</Paragraph>
      <Paragraph position="2"> We believe that SAS performed better at the twenty percent compression rate as a result of two characteristics: as the sum of scores across sub-events, this algorithm preferred both sentences that received higher scores, as well as sentences which were highly ranked most frequently. Therefore, it is weighted toward those sentences that carry information essential to several subevents. Because of these sentences' relevancy to more than one sub-event, they are most likely to be important to the majority of readers, regardless of the user's particular information task. This can also be seen as popularity weighting, with those sentences getting the most and best scores from judges producing the most useful summaries.</Paragraph>
      <Paragraph position="3"> The patterns uncovered by this result should be leveraged for future improvements to automatic summarizers.</Paragraph>
    </Section>
    <Section position="4" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
8.3 Lead-Based Summaries
</SectionTitle>
      <Paragraph position="0"> We were not extremely surprised to find that Lead-based summaries produced better summaries at the 10% summary rate. This result may be explained by the pyramid structure of news journalism, which, in a sense, pre-ranks document sentences in order of importance, in order to convey the most critical information first. As our corpus was comprised entirely of news articles, this effect could be exaggerated in our results. As expected, though, the Random summarizer set the lower bound.</Paragraph>
    </Section>
    <Section position="5" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
8.4 Manual Summaries and MEAD
</SectionTitle>
      <Paragraph position="0"> Most significantly, among the mid-range performers, the data demonstrates what we expected to find: Two of the three new sub-event-based algorithms perform better than MEAD. Identifying sub-events in news topic coverage is one method that we have shown can be utilized to help create better summaries.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="3" end_page="3" type="metho">
    <SectionTitle>
9. Automatic Clustering and Extraction
</SectionTitle>
    <Paragraph position="0"> In our second experiment, we were interested to see how the different strategies would work with a simple clustering-based multi-document summarizer. We did not expect our clustering algorithm to neatly partition the data according to the subevents we identified in our first experiment, but we did want to see if our findings about SAS would hold true for automatically partitioned data.</Paragraph>
    <Paragraph position="1"> And so we turned to sentence clustering. While Boros et al. (2001) report poor performance but some promise to this method, Hatzivassiloglou et al. (2001) have exploited clustering with very good results in SimFinder. Both rely on the RR method, although SimFinder considers several other important factors in sentence selection.</Paragraph>
    <Section position="1" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
9.1 Automatic Clustering
</SectionTitle>
      <Paragraph position="0"> Because of the vast number of variables associated with designing a cluster-based summarization algorithm, we chose to limit our system so that we could focus on RR, HSA and SAS. To give a sense of our performance, we also ran a purely centroid-based summarization algorithm.</Paragraph>
      <Paragraph position="1"> We used K-means clustering, and obtained results for K = 2-20, at both the 10% and 20% summary levels. By this process, we created K clusters, seeded them as discussed below, and then for each sentence, we found that cluster to which the sentence was closest. After filling the clusters, we checked again to see if each sentence was in its best cluster. We kept doing this until equilibrium was reached (usually no more than 6 cycles).</Paragraph>
      <Paragraph position="2"> For our similarity metric we used the cosine measure with inverse document frequency (IDF), inverse sentence frequency (ISF) (following Neto et al. (2000) and no term-weighting. We ran all of these permutations twice, once ignoring sentences with 9 words or fewer (as is MEAD's default) and once ignoring sentences with 2 words or just 1. We did not use stop words, stemming, or syntactic parsing. Further, we did not factor in the location of the sentences in their original documents, although both MEAD and SimFinder do this.</Paragraph>
      <Paragraph position="3"> Initially, we used a method of randomly seeding the clusters, but we found this method extremely unstable. We then devised the following method: 1) for the first cluster, find the sentence which is closest to the centroid of the document cluster, 2) for each sentence after that, find the sentence which is maximally different from those sentences already picked as seeds.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="3" type="sub_section">
      <SectionTitle>
9.2 Automatic Extraction
</SectionTitle>
      <Paragraph position="0"> After creating the clusters by this method, we extracted sentences with the same three methods of interest, HSA, SAS, and RR. For this experiment, we also added a simple Centroid policy. Under this policy, a centroid vector was created for all of the sentences, and then for each sentence the cosine measure was computed against the centroid. The sentences were then sorted by their cosine scores with the centroid. The top 10% or 20% were selected for the summary.</Paragraph>
      <Paragraph position="1"> For all policies, the extraction algorithm would not select a sentence which had a cosine of 0.99 or higher with any sentence already in the summary. For comparison, MEAD's default is 0.7. In the future, we would like to study the effect of this parameter on information diversity.</Paragraph>
    </Section>
  </Section>
  <Section position="10" start_page="3" end_page="3" type="metho">
    <SectionTitle>
10. Results for Automatic Clustering
</SectionTitle>
    <Paragraph position="0"> In Table 4, we report our findings from the second experiment. This table presents the average of the performances across all of the clustering options (2 clusters to 20 clusters) for the specified parameters. In general for a 10% summary, the SAS method outperforms the other methods, leading Centroid by only a small amount. At the 20% level, the Centroid policy beats all other algorithms, although SAS with ISF and a 2-word sentence minimum comes close.</Paragraph>
    <Paragraph position="1"> Some other interesting findings emerge from this table as well, namely term-weighting seems beneficial for all methods except for HSA, and ISF seems generally more beneficial for SAS and Centroid than for RR or HSA.</Paragraph>
  </Section>
  <Section position="11" start_page="3" end_page="3" type="metho">
    <SectionTitle>
SAS RR HSA Centroid SAS RR HSA Centroid
</SectionTitle>
    <Paragraph position="0"> marked variation in results depending on how many clusters were initially selected. In Table 5, we present our findings for the overall best parameters. As can be seen, SAS is the most common policy. In fact, SAS appears in the top 22 out 25 combinations at the 10% level and 20 out of 25 at the 20% compression level.</Paragraph>
    <Paragraph position="1"> Top 10 performers, 10% summary Top 10 performers, 20% summary  Tables 4 and Tables 5, taken together, suggest that SAS should be leveraged to improve performance over the pure centroid method. More work needs to be done to determine the appropriate number of clusters to begin with, but it is interesting that there appears to be an inverse relationship, namely, the smaller summary seems to benefit from small, tightly packed clusters, while the larger summary benefits from a few noisy clusters.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML