File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-1205_metho.xml

Size: 25,321 bytes

Last Modified: 2025-10-06 14:08:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-1205">
  <Title>An evolutionary approach for improving the quality of automatic summaries</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Corpus investigation
</SectionTitle>
    <Paragraph position="0"> Before we implemented our method, we wanted to learn if the continuity principle holds in human produced summaries. In order to perform this analysis we investigated a corpus of 146 human produced abstracts from the Journal of Artificial Intelligence Research (JAIR). 1 Most of the processing was done automatically using a simple script which tests if the principle is satisfied by pairs of consecutive utterances (i.e. if the pair has at least one head noun phrase in common). Those pairs which violate the principle were manually analysed.</Paragraph>
    <Paragraph position="1"> In our corpus almost 75% of the pairs of  consecutive utterances (614 out of 835) satisfy the principle. In terms of summaries, it was noticed that 44 out of 146 do not have any such pairs which violate the principle.</Paragraph>
    <Paragraph position="2"> After analysing the violations, we can explain them in one of the following ways: - In 126 out of 221 cases (57%) the link between utterances is realised by devices such as rhetorical relations.</Paragraph>
    <Paragraph position="3"> - In 76 cases (34%) the continuity principle was realised, but was not identified by the script because of words were replaced by semantic equivalents. In only 17 of these cases pronouns were used.</Paragraph>
    <Paragraph position="4"> - Ramifications in the discourse structure violate the principle in 19 cases (9%). These ramifications are usually explicitly marked by phrases such as firstly, secondly.</Paragraph>
    <Paragraph position="5"> After investigating our corpus we can definitely say that the continuity principle is present in human produced abstracts, and therefore by trying to enforce it in automatic summaries, we might produce better summaries. However, by using such approach we cannot be sure that the produced summaries are coherent, being known that it is possible to produce cohesive texts, but which are incoherent. In Section 4 we present a method which uses the continuity principle to score the sentences. This method is then evaluated in Section 5.</Paragraph>
    <Paragraph position="6"> We also have to emphasise that we do not claim that humans consciously apply the continuity principle when they produce summaries or any other texts. The presence of the violations identified in our corpus is an indication for this.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 The method
</SectionTitle>
    <Paragraph position="0"> Karamanis and Manurung (2002) used the continuity principle in text generation to choose the most coherent text from several produced by their generation system. In their case, the candidate texts were sequences of facts, their best ordering was determined by an evolutionary algorithm which tried to minimise the number of violations of the continuity principle they contained.</Paragraph>
    <Paragraph position="1"> We take a similar approach in our attempt to produce coherent summaries, trying to minimise the number of violations of the principle they contain.</Paragraph>
    <Paragraph position="2"> However, our situation is more difficult because a summarisation program needs firstly to identify the important information in the document and then present it in a coherent way, whereas in text generation the information to be presented is already known. &amp;quot;Understand and generate&amp;quot; methods would be appropriate, but they can only be applied to restricted domains. Instead, we employ a method which scores a sentence not only using its content, but also considering the context in which the sentence would appear in a summary. Two different algorithms are proposed. Both algorithms use the same content-based scoring method (see Section 4.1), but they use different approaches to extract sentences. As a result, the way the context-based scoring method defined in Section 4.2 is applied differs. The first algorithm is a greedy algorithm which does not always produce the best summary, but it is simple and fast. The second algorithm employs an evolutionary technique to determine the best set of sentences to be extracted.</Paragraph>
    <Paragraph position="3"> We should point out that another difference between our method and the ones used in text generation is that we do not intend to change the order of the extracted sentences. Such an addition would be interesting, but preliminary experiments did not lead to any promising results.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Content-based scoring method
</SectionTitle>
      <Paragraph position="0"> We rely on several existing scoring methods to determine the importance of a sentence on the basis of its content. In this section we briefly describe how this score is computed. The heuristics employed to compute the score are: Keyword method: uses the TF-IDF scores of words to compute the importance of sentences. The score of a sentence is the sum of words' scores from that sentence (Zechner, 1996) Indicator phrase method: Paice (1981) noticed that in scientific papers it is possible to identify phrases such as in this paper, we present, in conclusion, which are usually meta-discourse markers. A list of such phrases has been built and all the sentences which contain an indicating phrase have their scores boosted or penalised depending on the phrase.</Paragraph>
      <Paragraph position="1"> Location method: In scientific papers important sentences tend to appear at the beginning and end of the document. For this reason sentences in the first and the last 13 paragraphs have their scores boosted. This value was determined through experiments.</Paragraph>
      <Paragraph position="2"> Title and headers method: Words in the title and headers are usually important, so sentences containing these words have their scores boosted.</Paragraph>
      <Paragraph position="3"> Special formatting rules: Quite often certain important or unimportant information is marked in texts in a special way. In scientific paper it is common to find equations, but they rarely appear in the abstracts. For this reason sentences that contain equations are excluded.</Paragraph>
      <Paragraph position="4"> The score of a sentence is a weighted function of these parameters, the weights being established through experiments. As already remarked by other researchers, one of the most important heuristics proved to be the indicating phrase method.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Context-based scoring method
</SectionTitle>
      <Paragraph position="0"> Depending on the context in which a sentence appears in a summary, its score can be boosted or penalised. If the sentence which is considered satisfies the continuity principle with either the sentence that precedes or follows it in the summary to be produced, its score is boosted.2 If the continuity principle is violated the score is penalised. After experimenting with different values we decided to boost the sentence's score with the TF-IDF scores of the common NPs' heads and penalise with the highest TF-IDF score in the document.</Paragraph>
      <Paragraph position="1"> While analysing our corpus we noticed that large number of violations of the continuity principle are due to utterances in different segments. Usually this is explicitly marked by a phrase. We extracted a list of such phrases from our corpus and decided not to penalise those sentences which violate the continuity principle, but contain one of these phrases.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 The greedy algorithm
</SectionTitle>
      <Paragraph position="0"> The first of the two sentence selection algorithms is a greedy algorithm which always extracts the highest scored sentence from those not extracted yet. The sentences' scores are computed in the way described 2The way the sentences which precedes and follows it is determined depends very much on the algorithm used (see Sections 4.3 and 4.4 for details). If the sentence is the first or the last in a summary (i.e. there is no preceding or following sentence) the score is not changed.</Paragraph>
      <Paragraph position="1"> Given an extract a0a2a1a4a3a6a5a2a7a8a7a10a9 , a1a11a3a6a5a12a7a13a7a13a14 ,..., a1a4a3a6a5a12a7a13a7a13a15a17a16 and S the sentence which is considered for  extraction 1. Find a1a19a18a21a20a23a22a25a24 and a1a4a26a27a22a6a28a30a29 from the extract which are the closest sentences before and after S in the document, respectively.</Paragraph>
      <Paragraph position="2"> 2. Adjust the score S considering the context a1a31a18a21a20a23a22a25a24 , a1 , a1a11a26a2a22a6a28a30a29 .  in Section 4.2. Given that the original order of sentences is maintained in the summary, whenever a sentence is considered for extraction, the algorithm presented in Figure 1 is used. We should emphasise that at this stage the sentence is not extracted, but its score is computed as if it is included in the extract. After this process is completed for all the sentences which are not present in the extract, the one with the highest score is extracted. The process is repeated until the required length of the summary is reached. As it can be noticed, the algorithm cannot be applied to the first sentence. For this reason the first extracted sentence is always the one with the highest content-based score.</Paragraph>
      <Paragraph position="3"> It should be noted that it is possible to extract a sentence a32a34a33 which satisfies the continuity principle with its preceding sentence a32a36a35 , but in a later iteration to extract another sentence, which is between these two, and which satisfies the continuity principle with a32a37a35 , but not with a32a34a33 . Unfortunately, given the nature of the algorithm, it is impossible to go back and replace a32a34a33 with another sentence, and therefore sometimes the algorithm does not find the best set of sentences. In order to alleviate this problem, in the next section we present an algorithm which selects sentences using an evolutionary algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.4 The evolutionary algorithm
</SectionTitle>
      <Paragraph position="0"> The greedy algorithm presented in the previous section selects sentences in an interactive manner, the inclusion of a sentence in the summary depending on the sentences which were included before. As a result it is possible that the best summary is not produced. In order to alleviate this problem an algorithm which uses evolutionary techniques to select the set of sentences is proposed. Evolutionary algorithms are advanced searching algorithms which use techniques inspired by the nature to find the solution of a problem. A specific type of evolutionary algorithms are genetic  which contains the sentences 3, 5, 8, 10, 14, 18, 66, 79 from the document algorithms (Holland, 1975) which encode the problem as a series of genes, called chromosome. The most common way to encode genes is the binary encoding, where each gene can take the values 0 or 1. If we have decided to use such an encoding the value 0 would have meant not to include the sentence in the summary, whereas 1 to include it. For our problem the length of a chromosome would have been equal to the number of sentences in the texts. For long texts, such as the ones we use, this would have meant very long chromosomes, and as a result slow convergence, without any certainty that the best solution is found (Holland, 1975).</Paragraph>
      <Paragraph position="1"> Instead of using binary encoding, we decided that our genes take integer values, each value representing the position of a sentence from the original document to be included in the summary.</Paragraph>
      <Paragraph position="2"> The length of the chromosome is the desired length of the summary. Caution needs to be taken whenever a new chromosome is produced so the values of the genes are distinct (i.e. the summary contains distinct sentences). If a duplication is found in a chromosome, then the gene's value which contains the duplication is incremented by one. In this way the chromosome will contain two consecutive sentences, and therefore it could be more coherent.</Paragraph>
      <Paragraph position="3"> A chromosome is presented in Figure 2.</Paragraph>
      <Paragraph position="4"> Genetic algorithms use a fitness function to assess how good a chromosome is. In our case the fitness function is the sum of the scores of the sentences indicated in the chromosome. The sentences' scores are not considered &amp;quot;in isolation&amp;quot;, they are adjusted in the way described in Section 4.2. For this algorithm, determining the preceding and the following sentence is trivial, all the information being encoded in the chromosome.</Paragraph>
      <Paragraph position="5"> Genetic algorithms use genetic operators to evolve a population of chromosomes (Holland, 1975). In our case, we used weighed roulette wheel selection to select chromosomes. Once several chromosomes are selected they are evolved using crossover and mutation. We used the classical single point crossover operator and two mutation operators. The first one replaces the value of a gene with a randomly generated integer value. The purpose of this operator is to try to include random sentences in the summary and in this way to help the evolutionary process. The second mutation operator replaces the values of a gene with the value of the preceding gene incremented by one. This operator introduces consecutive sentences in the summary, which could improve coherence.</Paragraph>
      <Paragraph position="6"> The genetic algorithm starts with a population of randomly generated chromosomes which is then evolved using the operators. Each of the operators has a certain probability of being applied. The best chromosome (i.e. the one with the highest fitness score) produced during all generations is the solution to our problem. In our case we iterated a population of 500 chromosomes for 100 generations. Given that the search space (i.e. the set of sentences from the document) is very large we noticed that at least 50 generations are necessary until the best solution is achieved. The algorithm is evaluated in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Evaluation and discussion
</SectionTitle>
    <Paragraph position="0"> We evaluated our methods on 10 scientific papers from the Journal of Artificial Intelligence Research, totalising almost 90,000 words. The number of texts used for evaluation might seem small, but given that from each text we produced eight different summaries which had to be read and assessed by humans, the evaluation process was very time consuming.</Paragraph>
    <Paragraph position="1"> Throughout the paper we have mentioned the term quality of a summary several times without defining it. In this paper the quality of a summary is measured in terms of coherence, cohesion and informativeness. The coherence and cohesion were quantified through direct evaluation using a methodology similar to the one proposed in (Minel et al., 1997). The cohesion of a summary is indicated by the number of dangling anaphoric expressions,3 whereas the coherence by the number of ruptures in the discourse.</Paragraph>
    <Paragraph position="2"> For informativeness we computed the similarity between the automatic summary and the document as proposed in (Donaway et al., 2000). Given that the methods discussed in this paper try to enforce local coherence they directly influence only the number of discourse ruptures, the changes of the other two measures are a secondary effect.</Paragraph>
    <Paragraph position="3"> In our evaluation, we compared the two new algorithms with a baseline method and the content-based method. The baseline, referred to as TF-IDF, extracts the sentences with the highest TF-IDF scores. The comparison with the baseline does not tell us if by adding the context information described in Section 4.2 the quality of a summary improves. In order to learn this, we compared the new algorithms with the one presented in Section 4.1. They all use the same content-based scoring method, so if differences were noticed, they were due to the context information added and the way sentences are extracted.</Paragraph>
    <Paragraph position="4"> The results of the evaluation are presented in Tables 1, 2 and 3. In these tables TF-IDF represents the baseline, Basic method is the method described in section 4.1, whereas Greedy and Evolutionary are the two algorithms which use the continuity principle. In Table 1, the row Maximum indicates the maximum number of ruptures which could be found in that summary. This number is given by the total number of sentences in the summary.</Paragraph>
    <Paragraph position="5"> Given that for the direct evaluation the summaries had to be analysed manually, in a first step, we produced 3% summaries. After noticing only slight improvement when using our methods, we decided to increase their lengths to 5%, to learn if the methods perform better when they produce longer summaries. The values for the 5% summaries are represented in the tables in brackets.</Paragraph>
    <Paragraph position="6"> 3A dangling anaphor is a referential expression which is deprived of its referent as a result of extracting only the sentence with the anaphoric expression.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Number of ruptures in the discourse
</SectionTitle>
      <Paragraph position="0"> A factor which reduces the legibility is the number of discourse ruptures (DR). Using an approach similar to (Minel et al., 1997) we consider that a discourse rupture occurs when a sentence seems completely isolated from the rest of the text. Usually this happens due to presence of isolated discourse markers such as firstly, secondly, however, on the other hand, etc. Table 1 shows the number of DR in these summaries.</Paragraph>
      <Paragraph position="1"> A result which was expected is the large number of DR in the summaries produced by our baseline.</Paragraph>
      <Paragraph position="2"> Such a result is normal given that the method does not use any kind of discourse information. The baseline is outperformed by the rest of the methods in almost all the cases, the overall number of DR for each method being significantly lower than the DR of the baseline.</Paragraph>
      <Paragraph position="3"> Table 1 shows that for 3% summaries, the context information has little influence on the number of the discourse ruptures present in a summary.</Paragraph>
      <Paragraph position="4"> This suggests that the information provided by the indicating phrases (which are meta-discourse markers) has greater influence on the coherence of the summary than the continuity principle.</Paragraph>
      <Paragraph position="5"> The situation changes when longer summaries are considered. As can be observed in Table 1, the continuity principle reduces the number of DR; this number for the Evolutionary algorithm being almost half the number for Basic method. Actually, by examining the table, we can see that the evolutionary algorithm performs better than the basic method in all of the cases. The same cannot be said about the greedy algorithm. It performs more or less the same as the basic algorithm, the overall improvement being negligible. This clearly indicates that in our case a simple greedy algorithm is not enough to choose the set of sentences to extract, and more advanced techniques need to be used instead.</Paragraph>
      <Paragraph position="6"> The methods proposed in this paper perform better when longer summaries are produced. Such a result is not obtained only because the summary contains more sentences, and is therefore more likely to contain sentences which are related to each other. If this was the case, we would not have such a large number of DR in summaries generated by the baseline. We believe that the improvement is due to the discourse information used by the methods.</Paragraph>
      <Paragraph position="7"> If the values of DR for each text are scrutinised, we can notice very mixed values. For some of the texts the continuity principle helps a lot, but for others it has little influence. This suggests that for some of the texts the continuity principle is too weak to influence the quality of a summary, and a combination of the continuity principle with the other principles from centering theory, as already used for text generation in (Kibble and Power, 2000), could lead to better summaries.</Paragraph>
      <Paragraph position="8"> The methods proposed in this paper rely on several parameters to boost or penalise the scores of a sentence on the basis of context. A way to improve the results of these methods could be by selecting better values for these parameters.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Dangling anaphors
</SectionTitle>
      <Paragraph position="0"> Even though the problem of anaphora is not directly addressed by our methods, a subsidiary effect of the improvement of the local cohesion should be a decrease in the number of dangling references.</Paragraph>
      <Paragraph position="1"> Table 2 contains the number of dangling references in the summaries produced by different  methods. This number reduces in the summaries produced by the evolutionary algorithm. As in the case of discourse ruptures, the greedy algorithm does not perform significantly better than the basic method. All the methods outperform the baseline. We noticed that the most frequent dangling references were due to phrases referring to tables, figures, definitions and theorems (e.g. As we showed in Table 3 a0a1a0a1a0 ). They can be referred to in any point in the text, and therefore, the local coherence cannot guarantee inclusion of the referred entities. Moreover, in many cases the referred entity is not necessarily textual (e.g. tables and figure), and therefore should not be included in a summary.</Paragraph>
      <Paragraph position="2"> In light of these, we believe that the problem of such dangling references should be addressed by the content-based method, which normally should filter sentences containing them.</Paragraph>
      <Paragraph position="3"> Dangling referential pronouns are virtually nonexistent, which means that in most of the cases the reader can understand, at least partially, the meaning of the referential expression.</Paragraph>
      <Paragraph position="4"> As observed for DR, the values for individual texts are mixed.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Text informativeness
</SectionTitle>
      <Paragraph position="0"> In order to assess whether information is lost when the context-based method is used to enhance the sentence selection, we used a content-based evaluation metric (Donaway et al., 2000). This metric computes the similarity between the summary and the whole document, a good summary being one which has a value close to 1.4 Table 3 shows that the evolutionary algorithm</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4In this paper we used cosine distance between the
</SectionTitle>
      <Paragraph position="0"> document's vector and the automatic summary's vector. Before building the vectors the texts were lemmatised.</Paragraph>
      <Paragraph position="1"> does not lead to major loss of information, for several text this method obtains the highest score.</Paragraph>
      <Paragraph position="2"> In contrast, the greedy method seems to exclude useful information, for several texts, performing worse than the basic method and the baseline.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Related work
</SectionTitle>
    <Paragraph position="0"> In text summarisation several researchers have addressed the problem of producing coherent summaries. In general, rules are applied to revise summaries produced by a summarisation system (Mani et al., 1999; Otterbacher et al., 2002).</Paragraph>
    <Paragraph position="1"> These rules are produced by humans who read the automatic summaries and identify coherence problems. Marcu (2000) produced coherent summaries using Rhetorical Structure Theory (RST). A combination of RST and lexical chains is employed in (Alonso i Alemany and Fuentes Fort, 2003) for the same purpose. Comparison to the work by Marcu and Alonso i Alemany is difficult to make because they worked with different types of texts.</Paragraph>
    <Paragraph position="2"> As already mentioned, information from centering theory was used in text generation to select the most coherent text from several candidates (Kibble and Power, 2000; Karamanis and Manurung, 2002).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML