File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/w03-0503_intro.xml

Size: 7,509 bytes

Last Modified: 2025-10-06 14:01:53

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0503">
  <Title>Multi-document summarization using off the shelf compression software</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
1.1 The connection between text compression and
multidocument summarization
</SectionTitle>
      <Paragraph position="0"> A standard way for producing summaries of text documents is sentence extraction. In sentence extraction, the summary of a document (or a cluster of related documents) is a subset of the sentences in the original text (Mani, 2001). A number of techniques for choosing the right sentences to extract have been proposed in the literature, ranging from word counts (Luhn, 1958), key phrases (Edmundson, 1969), naive Bayesian classification (Kupiec et al., 1995), lexical chains (Barzilay and Elhadad, 1997), topic signatures (Hovy and Lin, 1999) and cluster centroids (Radev et al., 2000).</Paragraph>
      <Paragraph position="1"> Most techniques for sentence extraction compute a score for each individual sentence, although some recent work has started to pay attention to interactions between sentences. On the other hand, and particularly in multidocument summarization, some sentences may be redundant in the presence of others and such redundancy should lead to a lower score for each sentence proportional to the degree of overlap with other sentences in the summary. The Maximal Marginal Relevance (MMR) method (Carbonell and Goldstein, 1998) does just that.</Paragraph>
      <Paragraph position="2"> In this paper, we are taking the idea of penalizing redundancy for multi-document summaries further. We want to explore existing techniques for identifying redundant information and using them for producing better summaries.</Paragraph>
      <Paragraph position="3"> As in many areas in NLP, one of the biggest challenges in multi-document summarization is deciding on a way of calculating the similarity between two sentences or two groups of sentences. In extractive multi-document summarization, the goal is, on the one hand, to select the sentences which best represent the main point of the documents and, on the other, to pick sentences which do not overlap much with those sentences which have already been selected. To accomplish the task of sentence comparison, researchers have relied on stemming and counting n-gram similarity between two sentences. So, for example, if we have the following two sentences: &amp;quot;The dogs go to the parks&amp;quot; and &amp;quot;The dog is going to the park,&amp;quot; they would be nearly identical after stemming: &amp;quot;the dog [be] go to the park,&amp;quot; and any word overlap measure would be quite high (unigram cosine of .943).</Paragraph>
      <Paragraph position="4"> In some ways, gzip can be thought of as a radical stemmer which also takes into account n-gram similarity. If the two sentences were in a file that was gzipped, the size of the file would be much smaller than if the second sentence were &amp;quot;A cat wanders at night.&amp;quot; (unigram cosine of 0). By comparing the size of the compressed files, we can pick that sentence which is most similar to what has already been selected for the summary (high compression ratio) or the most different (low compression ratio), depending on what type of summary we would prefer.</Paragraph>
      <Paragraph position="5"> On a more information theoretic basis, as Benedetto et al. observe (Benedetto et al., 2002a), comparing the size of gzipped files allows us to roughly measure the distance (increase in entropy) between a new sentence and the already selected sentences. Benedetto et al. (Benedetto et al., 2002a) find that on their task of language classification, gzip's measure of information distance can effectively be used as a proxy for semantic distance. And so, we set out to see if we could usefully apply gzip to the task of multi-document summarization.</Paragraph>
      <Paragraph position="6"> Gzip is a compression utility which is publicly available and widely used (www.gzip.org). Benedetto et al.</Paragraph>
      <Paragraph position="7"> (Benedetto et al., 2002a) summarize the algorithm behind gzip and discuss its relationship to entropy and optimal coding. Gzip relies on the algorithm developed by Ziv and Lempel (Ziv and Lempel, 1977). Following this algorithm, gzip reads along a string and looks for repeated substrings, if it finds a substring which it has already read, it replaces the second occurrence with two numbers, the length of the substring and the distance from that loca-tion back to the original string. If the substring length is greater than the distance, then the unzipper will know that the sequence repeats.</Paragraph>
      <Paragraph position="8"> In our framework, we use an off-the-shelf extractive summarizer to produce a base summary. We then create a number of summaries containing precisely one more sentence than the base summary. If a7a8a9a7 is the total number of sentences in the input cluster, and a10 is the number of sentences already included in the base, there are a7a8a11a7a13a12a14a10 possible summaries of length a10a16a15a18a17 sentences. One of them has to be chosen over the others. In this work, we compress each of the a7a8a9a7a19a12a20a10 candidate summaries and observe the relative increase in the size of the compressed file compared to the compressed base summary. The basic idea is that sentences containing the most new information will result in relatively longer compressed summaries (after normalizing for the uncompressed length of the newly added sentence). We will discuss some variations of this algorithm in the next section.</Paragraph>
      <Paragraph position="9"> There are two issues which must be kept in mind in applying gzip to problems beyond data compression. First, because of the sequential nature of the algorithm, compression towards the beginning of the file will not be as great as that later in the file. Second, there is a 32k limit on the length of the window that gzip considers. So, if &amp;quot;abc&amp;quot; appears at the beginning of a string, and then also appears 33k later (but nowhere in between), gzip will not be able to compress the second appearance. This means that our process is &amp;quot;blind&amp;quot; to sentences in the summary which happen 32k earlier. This could potentially be a drawback to our approach, but in practice, given realistic text lengths, we have not found a negative effect.</Paragraph>
      <Paragraph position="10"> The impetus for our approach is (Benedetto et al., 2002a; Benedetto et al., 2002b) who report on their use of gzip for language classification, authorship attribution, and topic classification. In their approach, they begin with a set of known documents. For each document, they measure the ratio of the uncompressed document to the compressed document. Then they append an unknown document to each known document cluster, and compress these new documents. Their algorithm then chooses whichever document had the greatest compression in relation to its original. As (Goodman, 2002) observes, using compression techniques for these tasks is not an entirely new approach, nor is it very fast. Nevertheless, we wanted to determine the efficacy of applying Benedetto et al.'s methods to the task of multi-document summarization.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML