File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0503_metho.xml
Size: 23,216 bytes
Last Modified: 2025-10-06 14:08:25
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0503"> <Title>Multi-document summarization using off the shelf compression software</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Description of the method </SectionTitle> <Paragraph position="0"> The aim of this study was to determine if gzip is effective as a summarization tool when used in conjunction with an existing summarizer. We chose MEAD1, a public-domain summarization system, which can be downloaded on the Internet (Radev et al., 2002). The version of MEAD used in this experiment was 3.07.</Paragraph> <Paragraph position="1"> To produce a summary of a target length a10a21a15a22a17 sentences, we perform the following steps: 1. First, get MEAD to create a summary of size a10 sentences, where a10 is specified in advance. This summary will be called the base summary.</Paragraph> <Paragraph position="2"> 2. Compress the base summary using gzip. Let a23a25a24 be the length of the base summary before compression and a23a27a26 a24 be the size in bytes of its compressed version. 3. Create all possible summaries of length a10a28a15a29a17 using the remaining sentences in the input cluster.</Paragraph> <Paragraph position="3"> 4. Compress all summaries using gzip.</Paragraph> <Paragraph position="4"> 5. Pick the summary that results in the greatest increase in size in F, where F is one of a number of metrics, as described in the rest of this section.</Paragraph> <Paragraph position="5"> Example: if a cluster had five sentences total, and a user wanted to create a summary of one sentence from MEAD and one from gzip, then the program would start with the one sentence generated by MEAD and add each of the four remaining sentences to make a total of five extracts. Four of these extracts would have two sentences and one would have the one sentence created by MEAD. After these extracts have been created they are converted to summaries and the number of characters in each summary is calculated. Then the difference in length between the summaries with the one extra sentence and the original MEAD-only summary is computed and stored. The next step in the process is to gzip all of the summaries and compute the difference in size between the summaries with the extra sentence and the original MEAD-only summary and store this change in size. After all these steps have been executed, we have a list of all possible sentences, the number of characters they contain and the size increase they produce after being gzipped with the rest of the summary. Based on this information, we can choose the next sentence in summary depending on which sentence increases the size of the gzipped summary the most or which sentence has the best size to length ratio.</Paragraph> <Paragraph position="6"> We originally considered six evaluation metrics to use in this study. When choosing the next sentence for an existing summary, all possible sentences were added to the summary one at a time. For each sentence, the increase in length of the summary was measured and the increase in size of the gzipped summary was measured. From these two measurements we derived six policies.</Paragraph> <Paragraph position="7"> The top sizes policy picked the sentence which produced the greatest increase in the size of the summary when gzipped. The bot sizes policy picked the sentence which produced the smallest increase in the size of the summary when gzipped. The top lengths policy picked the sentence that increased the number of characters in the summary the most. The bot lengths picked the sentence that increased the number of the characters in the summary the least. The top ratios picked the sentence that had the greatest (size increase)/(length increase) and the bot ratios was the sentence that had the smallest (size increase)/(length increase). All policies except bot ratios, top lengths, and top sizes did not show promising preliminary results and so are not included in this paper. In addition, the top lengths policy does not really need gzip at all, and so it too is omitted from this paper. More information about the policies is given in the policies section.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 The clusters used </SectionTitle> <Paragraph position="0"> We performed our experiments on a series of clusters. A cluster is a group of articles all pertaining to one particular event or story. There were a total of five such clusters, and the same set of tests was carried out on each cluster independently from the others. All of our tests were conducted on five different clusters of documents, referred to here as the 125 cluster, 323 cluster, 46 cluster, 60 cluster and 1018 cluster. The lengths of each of these clusters in sentences was 232, 91, 344, 150, and 134, respectively.</Paragraph> <Paragraph position="1"> Clusters with such diverse lengths were purposely chosen to determine if the quality of the summaries was in any way related to the length of the source material. The various articles were taken from the Hong Kong News corpus provided by the Hong Kong SAR of the People's Republic of China (LDC catalog number LDC2000T46).</Paragraph> <Paragraph position="2"> This paper contains 18,146 pairs of parallel documents in English and Chinese, in our case only the English ones were used. The clusters were created at the Johns Hopkins University Summer Workshop on Language Engineering 2002.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 An example </SectionTitle> <Paragraph position="0"> Figure 1 shows a 5-sentence summary produced by MEAD from Cluster 125 of the HK News Corpus.</Paragraph> <Paragraph position="1"> The uncompressed length of this summary is 797 bytes whereas its size after gzip compression is 451 bytes.</Paragraph> <Paragraph position="2"> (1) To ensure broadly the same registration standards to be applied to all drug treatment and rehabilitation centres, Mrs Lo said the proposed registration requirements to be introduced for non-medical drug treatment and rehabilitation centres would be similar to those provisions of Ca.165 which currently apply to medical drug treatment and rehabilitation centres.</Paragraph> <Paragraph position="3"> (2) Youths-at-Risk of Substance Abuse and Families of Abusers Given Priority in This Year's Beat Drugs Fund Projects (3) he Action Committee Against Narcotics (ACAN) Research Sub-committee has decided to commission two major research on treatment and rehabilitation for drug abusers in Hong Kong in 1999.</Paragraph> <Paragraph position="4"> (4) New Initiatives Despite Fall in Number of Reported Drug Abusers (5) Beat Drugs Fund Grants $16 million in Support of 29 Anti-Drug Projects Cluster 125 includes 10 documents with a total of 232 sentences. In our example, after five of them have already been included in the 5-sentence summary, there are still 227 candidates for the sixth sentence to include in a 6sentence summary. As in the rest of the paper, we will be comparing summaries of equal length produced by two different methods, either (a) all sentences are chosen by MEAD, or (b) some sentences are chosen by MEAD and then the rest of the sentences until the target length of the summary are added by gzip.</Paragraph> <Paragraph position="5"> Figure 2 shows some statistics about these 227 sentences. null Figure 3 contains the list of sentences included in the five-sentence base summary.</Paragraph> <Paragraph position="6"> Figure 4 shows the candidate sentences to be included by the different policies in their six-sentence extracts.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experimental setup </SectionTitle> <Paragraph position="0"> To test the benefit of gzip in the summarization process, extracts were created using a combination of MEAD and gzip. These extracts contained pointers to the actual sentences that would be included in the summary, but not the sentences themselves. A number of extracts were created with varying amounts of sentences per extract. For these number six in a six-sentence summary. LENGTHORIG is the length in bytes of the summary, consisting of the original five MEAD-generated sentences plus this candidate sentence, before compression. SIZEORIG+1AFTGZ is the length in bytes of the compressed summary. DELTALENGTH is the difference in uncompressed length (which is also the length of the candidate uncompressed sentence). DELTASIZE is the change in compressed size. RATIO is equal to sentences in the base summary.</Paragraph> <Paragraph position="1"> extracts, the number of sentences contributed by MEAD was incremented by ten starting at zero and the number of sentences contributed by gzip was incremented from one to ten, on top of the MEAD sentences. So for any randomly chosens extract of size a8 , a7a8a9a7a33a32a35a34a37a36a19a38a39a17a41a40a43a42 indicates the number of sentence contributed by gzip. So an extract of fifty-six sentences contains fifty sentences from MEAD and six from gzip. In this way, a total of 110 extracts were created for all clusters except Cluster 323, for which only 80 extracts were created because there were only 91 sentences total in that cluster. For clarification, the 110 sentence extract for each cluster contained 100 MEAD sentences and 10 sentences from the chosen gzip policy. The 10 sentence extract for each cluster contained 0 MEAD sentences and 10 sentences from the chosen gzip policy. In order to have a benchmark to compare the gzip modified extracts to, extracts containing an identical number of sentences were created using only MEAD, so a 110 MEAD extract has all of its sentences chosen by MEAD. Relative utility was run on all types of gzip extracts, as well as only MEAD extracts.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Evaluation methods </SectionTitle> <Paragraph position="0"> We use the Relative Utility (RU) method (Radev et al., 2000) to compare our various summaries. To calculate RU, human judges read through all sentences in a document cluster and then give scores, from 1 (totally irrelevant) to 10 (central to the topic) to each sentence based on their impression of the importance of each sentence for a summary of the documents. Each judge's score is then normalized by his or her other scores. Finally, for each sentence, the judges' scores are summed and normalized again by the number of judges. Then a final score is given for a summary by summing the utility score for each sentence which was in the summary and then factoring in the the six policies to be the sixth sentence in the summary.</Paragraph> <Paragraph position="1"> upper bound (highest utility scores given by the judges) and lower bound (utility scores from randomly chosen sentences). We use this method because, as (Radev et al., 2002) find, Precision, Recall, and Kappa measures as well as content-based evaluation methods are unreliable for short summaries (5%-30%) and especially in the task of multi-document summarization, where there are likely to be several sentences which would contribute the same information to a summary.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Performance of Bot Ratios </SectionTitle> <Paragraph position="0"> When this project was in its initial stages, the ratios policy was designed in the hope that it would produce the highest quality sentences. However, it was not the bot ratios policy which was expected to succeed, but the top ratios.</Paragraph> <Paragraph position="1"> Top ratio sentences are ideally the sentences which provide the greatest increase in gzip size, for the smallest increase in summary length. Logically, these are the sentences that would appear to enhance the summary the most for the smallest cost. Bot ratio sentences are essentially the sentences which provide the greatest increase in summary length, for the smallest increase in size. In many cases, they are simply the longest sentences remaining to be used. The bot ratios policy was originally proposes and the Action Committee Against Narcotics supports that a registration scheme should be introduced for non-medical drug treatment and rehabilitation centres, in order to: (3) Notable amongst the approved projects for youths-at-risk are the $2 .5 million proposal to be organised by the Hong Kong Federation of Youth Groups featuring preventive education and guidance for 2 500 high-risk youths from primary and secondary schools , as well as from youth centres in Tsuen Wan and Kwai Tsing districts; and the $2 .3 million project by the Hong Kong Christian Service targeting at 3 000 youths-at-risk , including school drop-outs and unemployed young people , with a view to minimising their exposure to social and moral danger which could lead to substance abuse .</Paragraph> <Paragraph position="2"> (3) Notable amongst the approved projects for youths-at-risk are the $2 .5 million proposal to be organised by the Hong Kong Federation of Youth Groups featuring preventive education and guidance for 2 500 high-risk youths from primary and secondary schools , as well as from youth centres in Tsuen Wan and Kwai Tsing districts; and the $2 .3 million project by the Hong Kong Christian Service targeting at 3 000 youths-at-risk , including school drop-outs and unemployed young people , with a view to minimising their exposure to social and moral danger which could lead to substance abuse .</Paragraph> <Paragraph position="3"> to be the sixth sentence in the summary. The number in parentheses shows where in the summary this sentence will be added. For example, the first policy, bot lengths, would insert the short sentence &quot;Anti-drug work poses challenge&quot; between sentences 2 and 3 of the based five-sentence summary.</Paragraph> <Paragraph position="4"> included in this study only to confirm our initial expectations that the sentence with the smallest (increase in size) / (increase in length) will not improve the summary a great deal. However, we were surprised to find that our expectations for this policy were false. Upon examining the experimental results, it was found that the bot ratios policy, which is essentially picking the longest sentence in most cases, actually outperformed the existing summarizer by a considerable margin. Although this policy does not prove anything about the use of gzip in summarization, the surprising nature of its performance is certainly worth noting. Figure 6 shows scores for summaries created using bot ratios, top sizes and scores for summaries created using only MEAD.</Paragraph> <Paragraph position="5"> As is indicated in Figure 6, gzip's bot ratios policy outperformed MEAD by a significant margin in Cluster 323. There is an explanation for these scores which takes into account the fact that the top sizes policy had a lower score than MEAD for this cluster. In a cluster of documents, many of the short sentences are the most repetitive ones, usually simply stating the event that occurred or subject of the document and not containing any extraneous information. Most often it is the longer sentences which provide the extra information which makes for rich summaries. Since the ratio being used in this evaluation is size/length, many of the smaller sentences may have been eliminated from being chosen because of reasons mentioned above. This leaves only the longer sentences to choose from. Since the length of most sentences is far greater than the size increase when gzipped, it makes sense that most remaining sentences would have very low ratio scores. In a larger cluster, many of the sentences subsume each other since there are so many similar sentences, but in a small cluster such as 323 there is a great deal less subsumption. If gzip is picking sentences based on the bot ratios policy, normally it would pick many sentences that were very similar because the bot ratios policy relies on a greater sentence length as criteria for selection and the small change in gzip size provided by similar sentences would only lower the ratio for a potential sentence even more. However, since there is less repetition in a small cluster, the bot ratios policy ends up picking sentences which are more different from each other than in a larger cluster. These findings are quite surprising and do not agree with our expectations. The ratio policies were intended to balance the fact that larger sentences will obviously contain more information. The bot ratios relative utility scores however indicated that choosing larger sentences resulted in better summaries, with the exception of the 125 cluster. This contradicts the view that the sentence with the greatest increase in gzip size is better suited for a summary. The possible reasons for this contradiction are discussed in the next section.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Clusters and their sizes </SectionTitle> <Paragraph position="0"> One of the reasons that the bot ratios policy outscored the top sizes policy in two out of five clusters may be that the sample size in the clusters in which bot ratios outperformed top sizes was not large enough. This is illustrated by examining Clusters 125 and 46. In these clusters, the top sizes policy and bot ratios policy were either virtually identical or top sizes outperformed the bot ratios by a considerable margin. It is worth noting that Cluster 46 was by far the largest used in this study at 344 sentences and Cluster 125 was the second largest at 232 sentences.</Paragraph> <Paragraph position="1"> The third largest was Cluster 60 with 150 sentences, in which top sizes also beat bot ratios. The fact that the top sizes policy outscored the bot ratios in these clusters indicates that although in smaller clusters, a larger length indicates a better candidate due to decreased repetition, in a large cluster the sentences with larger length are quite repetitive and picking a sentence based on gzipped size is far more effective for summarization.</Paragraph> <Paragraph position="2"> This principle is illustrated on a smaller scale when examining the 46 cluster. In the first fifty extracts, the gzip bot ratios policy outscores the top sizes policy forty times. However, in the last 60 extracts, bot ratios outscored top sizes a mere five times. This indicates that early on the sentence with the longest length contains the most useful information, but as the size of the extract increases, the longer sentences start to become repetitive and therefore decrease the quality of the extract. One solution to this disparity between large and small clusters would be to alter how sentences are chosen based on cluster size or the size of the existing summary. If the cluster or summary was a small one, first all the sentences with the top lengths would be grouped, and of those the sentence with the highest gzip size would be chosen. If the cluster or summary was large, the sentence could be chosen on gzip size alone. Figure 8 is a table indicating scores for both policies and MEAD for this first and last ten sentences of each cluster. For all the clusters with the exception of 125, our hypothesis was correct. The top sizes method was better in the larger last extracts and the bot ratios prevailed early on in the small 1-10 sentence extracts.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Initial Size and RU Scores </SectionTitle> <Paragraph position="0"> Since the sentence that the gzip top sizes policy chooses is based on the amount of information that already exists in the summary, the quality of sentences chosen should depend on the amount of existing information. Therefore as the size of the extracts increases, the relative-utility scores should also increase for the top sizes policy. However there is also a general trend in which all relative utility scores increase as a function of extract size. So in order to determine if the top sizes policy is working correctly, we can compare the difference between MEAD and top sizes for the first twenty and the last twenty sentences of each extract and the difference should be greater for the last twenty.</Paragraph> <Paragraph position="1"> In four out of five cases (Figure 8), the top sizes policy behaved as it should have, increasing performance with increasing size. In the one case of cluster 60 where the performance over MEAD actually decreased as size of extract increased, it should be noted that MEAD improved more in this cluster than any other cluster. So although the top sizes policy still improved with regard to extract size, it could not improve as quickly as MEAD in that one cluster.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Conclusion </SectionTitle> <Paragraph position="0"> Overall, there were many instances when gzip outperformed MEAD. These mainly occurred after the first ten sentences because for the first ten sentences gzip had very little preliminary data to use in choosing the next sentence. Figure 11 lists how many times each policy beat MEAD after the first ten sentences of each cluster and the number of times that MEAD beat both gzip policies.</Paragraph> <Paragraph position="1"> The general trend was that both gzip policies out performed MEAD in medium length summaries between 2060 sentences. Furthermore, the top sizes policy outperformed MEAD more so in large summaries usually with 100+ sentences.</Paragraph> <Paragraph position="2"> A note on performance. Although theoretically interesting, our method is too slow for practical use in fast paced summarization systems. It takes time roughly proportional to the size, N of the summary desired. The bottleneck in this process is of course, the gzipping process.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5.1 Future Work </SectionTitle> <Paragraph position="0"> These results indicate that gzip can be used to enhance summaries or even produce large summaries from scratch. One metric lacking in our measurements is that of subsumption. If subsumption data were available for each of the clusters used, it would most likely favor gzip summaries as being more accurate because the gzip algorithm is designed to remove the very repetitiveness which subsumption measures. Further work remains to be done on other clusters of various sizes and redundancy as well as with other summarization metrics, such as content based metrics (cosine, overlap, longest-common substring, etc.). Nevertheless, we have established the potential benefits for applying gzip to the task of multi-document summarization.</Paragraph> </Section> class="xml-element"></Paper>