File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1077_metho.xml

Size: 24,145 bytes

Last Modified: 2025-10-06 14:08:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1077">
  <Title>Corpus and Evaluation Measures for Multiple Document Summarization with Multiple Sources</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 TSC3 Corpus
2.1 Guidelines for Corpus Construction
</SectionTitle>
    <Paragraph position="0"> Multiple document summarization from multiple sources, i.e., several newspapers concerned with the same topic but with different publishers, is more difficult than single document summarization since it must deal with more text (in terms of numbers of characters and sentences). Moreover, it is peculiar to multiple document summarization that the summarization system must decide how much redundant information should be deleted3.</Paragraph>
    <Paragraph position="1"> In a single document, there will be few sentences with the same content. In contrast, in multiple documents with multiple sources, there will be many sentences that convey the same content with different words and phrases, or even identical sentences. Thus, a text summarization system needs to recognize such redundant sentences and reduce the redundancy in the output summary.</Paragraph>
    <Paragraph position="2"> However, we have no way of measuring the effectiveness of such redundancy in the corpora for DUC and TSC2. Key data in TSC2 was given as abstracts (free summaries) whose number of characters was less than a fixed number and, thus, it is difficult to use for repeated or automatic evaluation, and for the extraction of important sentences. Moreover, in DUC, where most of the key data were abstracts whose number of words was less than a 3It is true that we need other important techniques such as those for maintaining the consistency of words and phrases that refer to the same object, and for making the results more readable; however, they are not included here.</Paragraph>
    <Paragraph position="3"> fixed number, the situation was the same as TSC2.</Paragraph>
    <Paragraph position="4"> At DUC 2002, extracts (important sentences) were used, and this allowed us to evaluate sentence extraction. However, it is not possible to measure the effectiveness of redundant sentences reduction since the corpus was not annotated to show sentence with same content. In addition, this is the same even if we use the SummBank corpus (Radev et al., 2003).</Paragraph>
    <Paragraph position="5"> In any case, because many of the current summarization systems for multiple documents are based on sentence extraction, we believe these corpora to be unsuitable as sets of documents for evaluation.</Paragraph>
    <Paragraph position="6"> On this basis, in TSC3, we assumed that the process of multiple document summarization consists of the following three steps, and we produce a corpus for the evaluation of the system at each of the three steps4.</Paragraph>
    <Paragraph position="7"> Step 1 Extract important sentences from a given set of documents Step 2 Minimize redundant sentences from the result of Step 1 Step 3 Rewrite the result of Step 2 to reduce the size of the summary to the specified number of characters or less.</Paragraph>
    <Paragraph position="8"> We have annotated not only the important sentences in the document set, but also those among them that have the same content. These are the corpora for steps 1 and 2. We have prepared human-produced free summaries (abstracts) for step 3. In TSC3, since we have key data (a set of correct important sentences) for steps 1 and 2, we conducted automatic evaluation using a scoring program. We adopted an intrinsic evaluation by human judges for step 3, which is currently under evaluation. We provide details of the extracts prepared for steps 1 and 2 and their evaluation measures in the following sections. We do not report the overall evaluation results for TSC3.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Data Preparation for Sentence Extraction
</SectionTitle>
      <Paragraph position="0"> We begin with guidelines for annotating important sentences (extracts). We think that there are two kinds of extract.</Paragraph>
      <Paragraph position="1">  2. A set of sentences that are suitable as a source  for producing an abstract, i.e., a set of sentences in the original documents that correspond to the sentences in the abstracts(Kupiec et al., 1995; Teufel and Moens, 1997; Marcu, 1999; Jing and McKeown, 1999).</Paragraph>
      <Paragraph position="2"> When we consider how summaries are produced, it seems more natural to identify important segments in the document set and then produce summaries by combining and rephrasing such information than to select important sentences and revise them as summaries. Therefore, we believe that second type of extract is superior and thus we prepared the extracts in that way.</Paragraph>
      <Paragraph position="3"> However, as stated in the previous section, with multiple document summarization, there may be more than one sentence with the same content, and thus we may have more than one set of sentences in the original document that corresponds to a given sentence in the abstract; that is to say, there may be more than one key datum for a given sentence in the abstract5.</Paragraph>
      <Paragraph position="4"> we have two sets of sentences that correspond to sentence a0 in the abstract.</Paragraph>
      <Paragraph position="5">  (1) a1a3a2 of document a4 , or (2) a combination of a1a6a5 and a1a8a7 of document a9  This means that a1a3a2 alone is able to produce a0 , and a0 can also be produced by combining a1 a5 and a1 a7 (Figure 1).</Paragraph>
      <Paragraph position="6"> We marked all the sentences in the original documents that were suitable sources for producing the sentences of the abstract, and this made it possible for us to determine whether or not a summarization system deleted redundant sentences correctly at Step 2. If the system outputs the sentences in the original documents that are annotated as corresponding to the same sentence in the abstract, it  has redundancy. If not, it has no redundancy. Returning to the above example, if the system outputs a1a32a2 ,a1a20a5 ,and a1a8a7 , they all correspond to sentence a0 in the abstract, and thus it is redundant.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Evaluation Metrics
</SectionTitle>
    <Paragraph position="0"> We use both intrinsic and extrinsic evaluation. The intrinsic metrics are &amp;quot;Precision&amp;quot;, &amp;quot;Coverage&amp;quot; and &amp;quot;Weighted Coverage.&amp;quot; The extrinsic metric is</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
&amp;quot;Pseudo Question-Answering.&amp;quot;
3.1 Intrinsic Metrics
3.1.1 Number of Sentences System Should
Extract
</SectionTitle>
      <Paragraph position="0"> Precision and Recall are generally used as evaluation matrices for sentence extraction, and we used the PR Breaking Point (Precision = Recall) for the evaluation of extracts in TSC1 (Fukusima and Okumura, 2001). This means that we evaluate systems when the number of sentences in the correct extract is given. Moreover, in TSC3 we assume that the number of sentences to be extracted is known and we evaluate the system output that has the same number of sentences.</Paragraph>
      <Paragraph position="1"> However, it is not as easy to decide the number of sentences to be extracted in TSC3 as in TSC1. We assume that there are correspondences between sentences in original documents and their abstract as in  the sets of corresponding sentences in the table. As shown in the table, we often see several sets of sentences that correspond to a sentence in the abstract in multiple document summarization.</Paragraph>
      <Paragraph position="2"> An 'extract' here is a set of sentences needed to produce the abstract. For instance, we can obtain 'extracts' such as &amp;quot;a1a32a2 ,a1a20a7 ,a1a20a33 ,a1a20a34 ,a1a8a7a20a35 ,a1a20a34a8a35 &amp;quot;, and &amp;quot;a1a32a2a36a35 ,a1a37a2a8a2 ,a1a20a7 ,a1a20a33 ,a1a20a34 ,a1a20a5a8a35 ,a1a20a5a32a2 ,a1a20a5a8a7 &amp;quot; from Table 1 6. Often there are several 'extracts' and we must determine which of these is the best. In such cases, we define the 'correct extract' as the set with the least number of sentences needed to produce the abstract because it is desirable to convey the maximum amount of information with the least number of sentences.</Paragraph>
      <Paragraph position="3"> Finding the minimum set of sentences to produce the abstract amounts to solving the constraint sat6In fact, it is possible to produce the abstract with other sentence combinations.</Paragraph>
      <Paragraph position="4"> isfaction problem. In the example in Table 1, we obtain the following constraints from each sentence in the abstract:</Paragraph>
      <Paragraph position="6"> With these conditions, we now find the minimum set that makes all the conjunctions true. We need to find the minimum set that makes a39 a2a57a48 a39 a5a47a48  a1a3a2a6a66a32a1a8a7a67a66a37a1a6a33a55a66a37a1a8a34a67a66a37a1a6a7a20a35a68a66a37a1a8a34a8a35a67a69 , and so the system should extract six sentences.</Paragraph>
      <Paragraph position="7"> In TSC3, we computed the number of sentences that the system should extract and then evaluated the system outputs, which must have the same number of sentences, with the following precision and coverage. null  Precision is the ratio of how many sentences in the system output are included in the set of the corresponding sentences. It is defined by the following equation.</Paragraph>
      <Paragraph position="8"> Precision a70a72a71a73a75a74 (1) where a76 is the least number of sentences needed to produce the abstract by solving the constraint satisfaction problem and a77 is the number of 'correct' sentences in the system output, i.e., the sentences that are included in the set of corresponding sentences. For example, the sentences listed in Table 1 are 'correct.' If the system output is &amp;quot;a1a32a2a30a35a67a66a32a1a32a2a8a2a14a66a37a1a6a33a3a66a32a1a32a2a8a78a14a66a37a1a6a34a20a35a68a66a37a1a8a34a32a2 &amp;quot;, then the Precision is as follows:</Paragraph>
      <Paragraph position="10"> for &amp;quot;a1a32a2a88a66a32a1a32a2a30a35a68a66a37a1a3a2a20a2a14a66a32a1a8a7a67a66a37a1a6a33a3a66a32a1a8a34a20a35 &amp;quot;, the Precision is as follows: null</Paragraph>
      <Paragraph position="12"> Coverage is an evaluation metric for measuring how close the system output is to the abstract taking into account the redundancy found in the set of sentences in the output.</Paragraph>
      <Paragraph position="13"> The set of sentences in the original documents that corresponds correctly to the a92 -th sentence of the human-produced abstract is denoted here as</Paragraph>
      <Paragraph position="15"> a93a95a94a97a96a101 . In this case, we have a102 sets of corresponding sentences. Here, a93a95a94a97a96a100 indicates a set of elements each of which corresponds to the sentence number in the original documents, denoted as a93a95a94a97a96a100 a41 a65a55a103 a94a22a96a100a20a96a2a14a66 a103 a94a97a96a100a20a96a5a55a66a37a99a32a99a37a99a28a66 a103 a94a22a96a100a20a96a104 a66a32a99a37a99a32a99a69 . For instance, from Table 1, a93 a2</Paragraph>
      <Paragraph position="17"> Then, we define the evaluation score a63a32a46a36a92a14a49 for the a92 -th sentence in the abstract as equation (1).</Paragraph>
      <Paragraph position="18">  Function a63 returns 1 (one) when any a93a98a94a97a96a100 is outputed completely. Otherwise it returns a partial score according to the number of sentences a145a93a95a94a110a100 a145. Given function a63 and the number of sentences in the abstract a146 , Coverage is defined as follows:  Now we define 'Weighted Coverage' since each sentence in TSC3 is ranked A, B or C, where &amp;quot;A&amp;quot; is the best. This is similar to &amp;quot;Relative Utility&amp;quot; (Radev et al., 2003). We only use three ranks in order to limit the ranking cost. The definition is obtained by modifying equation (6).</Paragraph>
      <Paragraph position="19"> W.C. a70</Paragraph>
      <Paragraph position="21"> where a61a158a46a30a92a12a49 denotes the ranking of the a92 -th sentence of the abstract and a162a163a46a36a61a158a46a36a92a14a49a8a49 is its weight. a146a164a46a36a61 a0 a146a164a165a158a49 is the number of sentences whose ranking is a61</Paragraph>
      <Paragraph position="23"> the abstract. Suppose the first sentence is ranked A, the second B, and the third C in Table 1, and their weights are given as a162a57a46 a93 a49a155a41a168a167 ,a162a163a46a20a169a170a49a171a41a168a172a55a64a174a173 and</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Extrinsic Metrics
</SectionTitle>
      <Paragraph position="0"> Sometimes question-answering (QA) by human subjects is used for evaluation (Morris et al., 1992; Hirao et al., 2001). That is, human subjects judge whether predefined questions can be answered by reading only a machine generated summary. However, the cost of this evaluation is huge. Therefore, we employ a pseudo question-answering evaluation, i.e., whether a summary has an 'answer' to the question or not. The background to this evaluation is inspired by TIPSTER SUMMAC's QA track (Mani et al., 2002).</Paragraph>
      <Paragraph position="1"> For each document set, there are about five questions for a short summary and about ten questions for long summary. Note that the questions for the short summary are included in the questions for the long summary. Examples of questions for the topic &amp;quot;Release of SONY's AIBO&amp;quot; are as follows: &amp;quot;How much is AIBO?&amp;quot;, &amp;quot;When was AIBO sold?&amp;quot;, and &amp;quot;How many AIBO are sold?&amp;quot;.</Paragraph>
      <Paragraph position="2"> Now, we evaluate the summary from the 'exact match' and 'edit distance' for each question. 'Exact match' is a scoring function that returns one when the summary includes the answer to the question. 'Edit distance' measures whether the system's summary has strings that are similar to the answer strings. The score a179 e a180 based on the edit distance is normalized with the length of the sentence and the answer string so that the range of the score is [0,1]: Sed a70 length of the sentence a181 edit distancelength of the answer strings a84 (9) The score for a summary is the maximum value of the scores for sentences in the summary. The  score is 1 if the summary has a sentence that includes the whole answer string.</Paragraph>
      <Paragraph position="3"> It should be noted that the presence of answer strings in the summary does not mean that a human subject can necessarily answer the question.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Preliminary Experiment
</SectionTitle>
    <Paragraph position="0"> In order to examine whether our corpus is suitable for summarization evaluation, our evaluation measures significant information and redundancies in the system summaries.</Paragraph>
    <Paragraph position="1"> Below we provide the details of the corpus, evaluation results and effectiveness of the minimization of redundant sentences.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Description of Corpus
</SectionTitle>
      <Paragraph position="0"> According to the guidelines described in section two, we constructed extracts and abstracts of thirty sets of documents drawn from the Mainichi and Yomiuri newspapers published between 1998 to 1999, each of which was related to a certain topic.</Paragraph>
      <Paragraph position="1"> First, we prepared abstracts (their sizes were 5% and 10% of the total number of the characters in the document set), then produced extracts using the abstracts. Table 2 shows the statistics.</Paragraph>
      <Paragraph position="2"> One document set consists of about 10 articles on average, and the almost same number of articles were taken from the Mainichi newspaper and the Yomiuri newspaper. Most of the topics are classified into a single-event according to McKeown (2001).</Paragraph>
      <Paragraph position="3"> The following list contains all the topics.</Paragraph>
      <Paragraph position="4">  0310 Two-and-half-million-year old new hominid species found in Ethiopia.</Paragraph>
      <Paragraph position="5"> 0320 Acquisition of IDC by NTT (and C&amp;W).</Paragraph>
      <Paragraph position="6"> 0340 Remarketing of game software judged legal by Tokyo District Court.</Paragraph>
      <Paragraph position="7"> 0350 Night landing practice of carrier-based aircrafts of the Independence.</Paragraph>
      <Paragraph position="8"> 0360 Simultaneous bombing of the US Embassies in Tanzania and Kenya.</Paragraph>
      <Paragraph position="9"> 0370 Resignation of President Suharto.</Paragraph>
      <Paragraph position="10"> 0380 Nomination of Mr. Putin as Russian prime minister. 0400 Osama bin Laden provided shelter by Taliban regime in Afghanistan.</Paragraph>
      <Paragraph position="11"> 0410 Transfer of Nakata to A.C. Perugia.</Paragraph>
      <Paragraph position="12"> 0420 Release of Dreamcast.</Paragraph>
      <Paragraph position="13"> 0440 Existence of Japanese otter confirmed.</Paragraph>
      <Paragraph position="14"> 0450 Kyocera Corporation makes Mita Co. Ltd. its subsidiary. 0460 Five-story pagoda at Muroji Temple damaged by typhoon. null 0470 Retirement of aircraft YS-11.</Paragraph>
      <Paragraph position="15"> 0480 Test observation of astronomical telescope 'Subaru' started.</Paragraph>
      <Paragraph position="16"> 0500 Dolly the cloned sheep.</Paragraph>
      <Paragraph position="17"> 0510 Mass of neutrinos.</Paragraph>
      <Paragraph position="18"> 0520 Human Genome Project finishes decoding of the 22nd chromosome.</Paragraph>
      <Paragraph position="19"> 0530 Peace talks in Northern Ireland at the end of 1999. 0540 Debut of new model of bullet train (700 family). 0550 Mr. Yukio Aoshima decides not to run for gubernatorial election.</Paragraph>
      <Paragraph position="20"> 0560 Mistakes in entrance examination of Kansai University. 0570 Space shuttle Endeavour, from its launch to return. 0580 40 million-year-old fossil of new monkey species found by research group at Kyoto University.</Paragraph>
      <Paragraph position="21"> 0590 Dead body of George Mallory found on Mt. Everest. 0600 Release of SONY's AIBO.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Compared Extraction Methods
</SectionTitle>
      <Paragraph position="0"> We used the lead-based method, the TFa99 IDF-based method (Zechner, 1996) and the sequential pattern-based method (Hirao et al., 2003), and compared performance of these summarization methods on the TSC3 corpus.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Lead-based Method
</SectionTitle>
      <Paragraph position="0"> The documents in a test set were sorted in chronological and ascending order. Then, we extracted a sentence at a time from the beginning of each document and collected them to form a summary.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
TFa99 IDF-based Method
</SectionTitle>
    <Paragraph position="0"> The score of a sentence is the sum of the significant scores of each content word in the sentence. We therefore extracted sentences in descending order of importance score. The sentence score Stfidfa46a20a1 a94 a49 is defined by the following.</Paragraph>
    <Paragraph position="2"> a59a91a206a207a46a36a59a88a66a37a196a171a179a98a49 is the frequency of word a59 in the document set, a208a23a206a140a46a36a59a88a49 is the document frequency of a59 , and a145a196a155a169a52a145 is the total number of documents in the set. In fact, we computed these using all the articles published in the Mainichi and Yomiuri newspapers for the years 1998 and 1999.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Sequential Pattern-based Method
</SectionTitle>
      <Paragraph position="0"> The score of a sentence is the sum of the significant scores of each sequential pattern in the sentence. The patterns used for scoring were decided</Paragraph>
      <Paragraph position="2"> where a162a163a46a36a213a207a49 is defined as follows:</Paragraph>
      <Paragraph position="4"> a206a140a46a30a213a207a66a3a196a155a179a95a49 is the sentence frequency of pattern a213 in the document set and a206a140a46a30a213a207a66 a93 a179a98a49 is the sentence frequency of patterna213 in all topics. a145 a93 a179a114a145 is the number of sentences in all topics and a219a36a63a20a146a166a46a30a213a207a49 is the pattern length.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Evaluation Result
</SectionTitle>
      <Paragraph position="0"> Table 3 shows the intrinsic evaluation result. All methods have lower Coverage and Weighted Coverage scores than Precision scores. This means that the extracted sentences include redundant ones. In particular, the difference between &amp;quot;Precision&amp;quot; and &amp;quot;Coverage&amp;quot; is large in &amp;quot;Pattern.&amp;quot; Although both &amp;quot;Pattern&amp;quot; and &amp;quot;TFa99 IDF&amp;quot; outperform &amp;quot;Lead,&amp;quot; the difference between them is small. In addition, we know that &amp;quot;Lead&amp;quot; is a good extraction method for newspaper articles; however, this is not true for the TSC3 corpus.</Paragraph>
      <Paragraph position="1"> Table 4 shows the extrinsic evaluation results.</Paragraph>
      <Paragraph position="2"> Again, both &amp;quot;Pattern&amp;quot; and &amp;quot;TFa99 IDF&amp;quot; outperform &amp;quot;Lead&amp;quot;, but the difference between them is small.</Paragraph>
      <Paragraph position="3"> We found a correlation between the intrinsic and ex-</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 Effect of Redundant Sentence
Minimization
</SectionTitle>
      <Paragraph position="0"> The experiment described in the previous section shows that a group of sentences extracted in a simple way includes many redundant sentences. To examine the effectiveness of minimizing redundant sentences, we compare the Maximal Marginal Relevance (MMR) based approach (Carbonell and Goldstein, 1998) with the clustering approach (Nomoto and Matsumoto, 2001). We use 'cosine similarity' with a bag-of-words representation for the similarity measure between sentences.</Paragraph>
      <Paragraph position="1"> Clustering-based Approach After computing importance scores using equations (10) and (12), we conducted hierarchical clustering using Ward's method until we reached a76 (see Section 3.1.1) clusters for the first a175a55a76 sentences. Then, we extracted the sentence with the highest score from each cluster.</Paragraph>
      <Paragraph position="2"> Table 5 shows the results of the intrinsic evaluation and Table 6 shows the results of the extrinsic evaluation. By comparison with Table 3, the clustering-based approach resulted in TFa99 IDF and Pattern scoring low in Precision, but high in Coverage. When comparing Table 4 with Table 6, the score is improved in most cases. These results imply that redundancy minimization is effective for improving the quality of summaries.</Paragraph>
      <Paragraph position="3"> MMR-based Approach After computing importance scores using equations (10) and (12), we re-ranked the first a175a68a76 sentences by MMR and extracted the first a76 sentences.</Paragraph>
      <Paragraph position="4"> Table 7 and 8 show the intrinsic and extrinsic evaluation results, respectively. We can see the effectiveness of redundancy minimization by MMR.</Paragraph>
      <Paragraph position="5"> Notably, in most cases, there is a large improvement in both the intrinsic and extrinsic evaluation results as compared with clustering.</Paragraph>
      <Paragraph position="6">  These results show that redundancy minimization has a significant effect on multiple document summarization. null</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML