File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-0403_metho.xml
Size: 14,978 bytes
Last Modified: 2025-10-06 14:07:20
<?xml version="1.0" standalone="yes"?> <Paper uid="W00-0403"> <Title>I I I I I I I I I I I I I I I I I I I Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies</Title> <Section position="4" start_page="21" end_page="22" type="metho"> <SectionTitle> 2 Informational content of sentences </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.1 Cluster-based sentence utility (CBSU) </SectionTitle> <Paragraph position="0"> Cluster-based sentence utility (CBSU, or utility) refers to the degree of relevance (from 0 to 10) of a &quot; particular sentence to the general topic of the entire cluster (for a dis cussion of what is a topic, see \[Allan et al. 1998\]). A utility of 0 means that the sentence is not relevant to the cluster and a 10 inarks an essential sentence.</Paragraph> </Section> <Section position="2" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.2 Cross-sentence informational </SectionTitle> <Paragraph position="0"> subsumption (CSIS) A related notion to CBSU is cross-sentence informational subsumption (CSIS, or subsumption), which reflects that certain sentences repeat some of the information present in other sentences and may, therefore, be omitted during summarization. If the information content of sentence a (denoted as i(a)) is contained within sentence b, then a becomes informationally redundant and the content of b is said to subsume that of a: i(a) c: i(b) In the example below, (2) subsumes (1) because the crucial information in (1) is also included in (2) which presents additional content: &quot;the court&quot;, &quot;last. August&quot;, and &quot;sentenced him to life&quot;. (1) John Doe was found guilty of the murder.</Paragraph> <Paragraph position="1"> (2) The court found John Doe guilty of the murder of Jane Doe last August and sentenced him to life. The cluster shown in Figure I shows subsumption links across two articles ~ about recent terrorist activities in Algeria (ALG 18853 and ALG 18854). An arrow from sentence A to sentence B indicates that the information content of A is subsumed by the information content of B. Sentences 2, 4, and 5 from the first article repeat the information from sentence I The full text of these articles is shown in the Appendix.</Paragraph> <Paragraph position="2"> 2 in the second article, while sentence' 9 from the former article is later repeated in sentences 3 and 4 of the latter article.</Paragraph> </Section> <Section position="3" start_page="21" end_page="21" type="sub_section"> <SectionTitle> 2.3 Equivalence classes of sentences </SectionTitle> <Paragraph position="0"> Sentences subsuming each other are said to belong to the same equivalence class. An equivalence class may contain more than two sentences within the same or different articles. In the following example, although sentences (3) and (4) are not exact paraphrases of each other, they can be substituted for each other without crucial loss of information and therefore belong to the same equivalence class, i.e.</Paragraph> <Paragraph position="1"> i(3) c i(4) and i(4) c i(3). In the user study section we will take a look at the way humans perceive CSIS and equivalence class.</Paragraph> <Paragraph position="2"> (3) Eighteen decapitated bodies have been found in a mass grave in northern Algeria, press reports said Thursday.</Paragraph> <Paragraph position="3"> (4) Algerian newspapers have reported on Thursday that 18 decapitated bodies have been found by the authorities.</Paragraph> </Section> <Section position="4" start_page="21" end_page="22" type="sub_section"> <SectionTitle> 2.4 Comparison with MMR </SectionTitle> <Paragraph position="0"> Maximal marginal relevance (or MMR) is a technique similar to CSIS and was introduced in \[Carbonell and Goldstein, 1998\]. In that paper, MMR is used to produce summaries of single documents that avoid redundancy. The authors mention that their preliminary results indicate that multiple documents on the same topic also contain redundancy but they fall short of using MMR for multi-document summarization. Their metric is used as an enhancement to a query-based summary whereas CSIS is designed for query-independent (a.k.a., generic) summaries.</Paragraph> <Paragraph position="1"> We now describe the corpus used for the evaluation of MEAD, and later in this section we present MEAD's algorithm.</Paragraph> </Section> <Section position="5" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 3.1 Description of the corpus </SectionTitle> <Paragraph position="0"> For our experiments, we prepared, a small corpus consisting of a total of 558 sentences in 27 documents, organized in 6 clusters (Table 1), all extracted by CIDR. Four of the clusters are from Usenet newsgroups. The remaining two clusters are from the official TDT corpus 2. Among the factors for our selection of clusters are: coverage of as many news sources as possible, coverage of both TDT and non-TDT data, coverage of different types of news (e.g., terrorism, internal affairs, and environment), and diversity in cluster sizes (in our case, from 2 to 10 articles). The test corpus is used in the evaluation in such a way that each cluster is summarized at 9 different compression rates, thus giving nine times as many sample points as one would expect from the size of the corpus.</Paragraph> </Section> <Section position="6" start_page="22" end_page="22" type="sub_section"> <SectionTitle> 3.2 Cluster centroids </SectionTitle> <Paragraph position="0"> Table 2 shows a sample centroid, produced by CIDR \[Radev et al., 1999\] from cluster A. The &quot;count&quot; column indicates the average number of occurrences of a word'across the entire cluster. The IDF values were computed from the TDT corpus. A centroid, in this context, is a pseudo-document which consists of words which have Count*IDF scores above a pre-defined threshold in the documents that constitute the cluster. CIDR computes Count*IDF in an iterative fashion, updating its values as more articles are inserted in a given cluster. We hypothesize that sentences that contain the words from the centroid are more indicative of the topic of the cluster.</Paragraph> </Section> </Section> <Section position="5" start_page="22" end_page="23" type="metho"> <SectionTitle> 2 The selection of Cluster E is due to an idea by the </SectionTitle> <Paragraph position="0"> participants in the Novelty Detection Workshop, led</Paragraph> <Section position="1" start_page="22" end_page="23" type="sub_section"> <SectionTitle> 3.3 Centroid-based algorithm </SectionTitle> <Paragraph position="0"> MEAD decides which sentences to include in the extract by ranking them according to a set of parameters. The input to MEAD is a cluster of articles (e.g., extracted by CIDR) and a value for the compression rate r. For example, if the cluster contains a total of 50 sentences (n = 50) and the value of r is 20%, the output of MEAD will contain 10 sentences. Sentences are laid in the same order as they appear in the original documents with documents ordered chronologically. We benefit here from the time stamps associated with each document.</Paragraph> <Paragraph position="1"> SCORE (s) = Zi (wcC, + + wpJ where i (1 ~ i ~_ n) is the sentence number within the cluster.</Paragraph> <Paragraph position="2"> INPUT: Cluster of d documents 3 with n sentences (compression rate = r) 3 Note that currently, MEAD requires that sentence boundaries be marked.</Paragraph> <Paragraph position="3"> The system performance S is one of the numbers 6 described in the previous subsection. For { 13}, the value of S is 0.627 (which is lower than random). For {14}, S is 0.833, which is between R and J. In the example, only two of the six possible sentence selections, {14} and {24} are between R and J. Three others, {13}, {23}, and {34} are below R. while {12} is better than J.</Paragraph> <Paragraph position="4"> 4.2.4. Normalized system performance (1)) To restrict system performance (mostly) between 0 and 1, we use a mapping between R and J in such a way that when S ffi R, the normalized system performance, D, is equal to 0 and when S = J, D</Paragraph> <Paragraph position="6"> Figure 2 shows the mapping .between system performance S on the left (a) and normalized system performance D on the fight Co). A small part of the 0i segment is mapped to the entire 0-1 segment; therefore the difference between two systems, performing at e.g., 0.785 and 0.812 can be Example: the normalized system performance for the {14} system then becomes (0.833 - 0.732)/(0.841 0.732) or 0.927. Since the score is close to I, the {14} system is almost as good as the interjudge agreement. The normalized system performance for the {24} system is similarly (0.837 - 0.732) / (0.841 7 The formula is valid when J > R (that is, the judges agree among each other better than randomly).</Paragraph> <Paragraph position="7"> - 0.732) or 0.963. Of the two systems, {24} outperforms { 14}.</Paragraph> </Section> <Section position="2" start_page="23" end_page="23" type="sub_section"> <SectionTitle> 4.3 Using CSIS to evaluate multi-document summaries </SectionTitle> <Paragraph position="0"> To use CSIS in the evaluation, we introduce a new parameter, E, which tells us how much to penalize a system that includes redundant information. In the example from Table 7 (arrows indicate subsumption), a summarizer with r = 20% needs to pick 2 out of 12 sentences. Suppose that it picks 1/I and 2/1 (in bold).</Paragraph> <Paragraph position="2"> as it is subsumed by the first sentence. By varying E between 0 and 1, the evaluation may favor or ignore subsumption.</Paragraph> </Section> </Section> <Section position="6" start_page="23" end_page="27" type="metho"> <SectionTitle> 3 articles) 5 User studies and system evaluation </SectionTitle> <Paragraph position="0"> We ran two user experiments. First, six judges were each given six clusters and asked to ascribe an importance score from 0 to 10 to each sentence within a particular cluster. Next, five judges had to indicate for each sentence which other sentence(s), if any, it subsumes s.</Paragraph> <Section position="1" start_page="23" end_page="26" type="sub_section"> <SectionTitle> 5.1 CBSU: interjudge agreement </SectionTitle> <Paragraph position="0"> Using the techniques described in Section 0, we computed the cross-judge agreement (J) for the 6 clusters for various r (Figure 3). Overall, interjudge agreement was quite high. An interesting drop in interjudge agreement occurs for 20-30% summaries.</Paragraph> <Paragraph position="1"> The drop most likely results from the fact that 10% summaries are typically easier to produce because the few most imporiant sentences in a cluster are easier to identify.</Paragraph> <Paragraph position="2"> s We should note that both annotation tasks were quite time consuming and frustrating for the users who took anywhere from 6 to 10 hours each to complete their part.</Paragraph> </Section> <Section position="2" start_page="26" end_page="26" type="sub_section"> <SectionTitle> 5.2 CSIS: interjudge agreement </SectionTitle> <Paragraph position="0"> In the second experiment, we asked users to indicate all cases when within a cluster, a sentence is subsumed by another. The judges' data on the first seven sentences of cluster A are shown in Table 8.</Paragraph> <Paragraph position="1"> The &quot;+ score&quot; indicates the number of judges who agree on the most frequent subsumption. The '-* score&quot; indicates that the consensus was no subsumption. We found relatively low interjudge agreement on the cases in which at least one judge indicated evidence of subsumption. Overall, out of 558 sentences, there was full agreement (5 judges) on 292 sentences (Table 9). Unfortunately, h 291 of these 292 sentences the agreement was that there is no subsumption. When the bar of agreement was lowered to four judges, 23 out of 406 agreements are on sentences with subsumption. Overall, out of 80 sentences with subsumption, only 24 had an agreement of four or more judges. However, in 54 eases at least three judges agreed on the presence of a particular instance of subsumption.</Paragraph> <Paragraph position="2"> In conclusion, we found very high interjudge agreement in the first experiment and moderately low agreement in the second experiment. We concede that the time necessary to do a proper job at the second task is partly to blame.</Paragraph> </Section> <Section position="3" start_page="26" end_page="27" type="sub_section"> <SectionTitle> 5.3 Evaluation of MEAD </SectionTitle> <Paragraph position="0"> Since the baseline of random sentence selection is already included in the evaluation formulae, we used the Lead-based method (selecting the positionally first (n'r/e) sentences from each cluster where c -- number of clusters) as the baseline to evaluate our syt;tem.</Paragraph> <Paragraph position="1"> In Table 10 we show the normalized performance (D) of MEAD, for the six clusters at nine compression rates. MEAD performed better than Lead in 29 (in bold) out of 54 cases. Note that for the largest cluster, Cluster D, MEAD outperformed Lead at all compression rates.</Paragraph> <Paragraph position="3"> We then modified the MEAD algorithm to include lead information as well as centroids (see Section 0).</Paragraph> <Paragraph position="4"> In this case, MEAD+Lead performed better than the Lead baseline in 41 cases. We are in the process of running experiments with other SCORE formulas.</Paragraph> </Section> <Section position="4" start_page="27" end_page="27" type="sub_section"> <SectionTitle> 5.4 Discussion </SectionTitle> <Paragraph position="0"> It may seem that utility-based evaluation requires too much effort and is prone to low interjudge agreement.</Paragraph> <Paragraph position="1"> We believe that our results show that interjudge agreement is quite high. As far as the amount of effort required, we believe that the larger effort on the part of the judges is more or less compensated with the ability to evaluate summaries off-line and at variable compression rates. Alternative evaluations don't make such evaluations possible. We should concede that a utility-based approach is probably not feasible for query-based summaries as these are typically done only on-line.</Paragraph> <Paragraph position="2"> We discussed the possibility of a sentence contributing negatively to the utility of another sentence due to redundancy. We should also point out that sentences can also reinforce one another positively. For example, if a sentence mentioning a new entity is included in a summary, one might also want to include a sentence that puts the entity in the context of the reSSt of the article or cluster.</Paragraph> </Section> </Section> class="xml-element"></Paper>