XML Viewer - p04-1049

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1049_metho.xml
Size: 15,559 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1049">
  <Title>Paragraph-, word-, and coherence-based approaches to sentence ranking: A comparison of algorithm and human performance</Title>
  <Section position="4" start_page="2" end_page="4" type="metho">
    <SectionTitle>
3 Coherence-based summarization revisited
</SectionTitle>
    <Paragraph position="0"> This section will discuss in more detail the data structures we used to represent discourse structure, as well as the algorithms used to calculate sentence importance, based on discourse structures.</Paragraph>
    <Section position="1" start_page="2" end_page="3" type="sub_section">
      <SectionTitle>
3.1 Representing coherence structures
</SectionTitle>
      <Paragraph position="0"> Discourse segments can be defined as non-overlapping spans of prosodic units (Hirschberg &amp; Nakatani (1996)), intentional units (Grosz &amp; Sidner (1986)), phrasal units (Lascarides &amp; Asher (1993)), or sentences (Hobbs (1985)). We adopted a sentence unit-based definition of discourse segments for the coherence-based approach that assumes non-tree graphs. For the coherence-based approach that assumes trees, we used Marcu (2000)'s more fine-grained definition of discourse segments because we used the discourse trees from Carlson et al. (2002)'s database of coherenceannotated texts.</Paragraph>
      <Paragraph position="1">  We assume a set of coherence relations that is similar to that of Hobbs (1985). Below are examples of each coherence relation.</Paragraph>
      <Paragraph position="2">  elaboration, temporal sequence, and attribution are asymmetrical or directed relations, whereas similarity, contrast, and temporal sequence are symmetrical or undirected relations (Mann &amp; Thompson, 1988; Marcu, 2000). In the non-tree-based approach, the directions of asymmetrical or directed relations are as follows: cause effect for cause-effect; cause absent effect for violated expectation; condition consequence for condition; elaborating elaborated for elaboration, and source attributed for attribution. In the tree-based approach, the asymmetrical or directed relations are between a more important discourse segment, or a Nucleus, and a less important discourse segment, or a Satellite (Marcu (2000)). The Nucleus is the equivalent of the arc destination, and the Satellite is the equivalent of the arc origin in the non-tree-based approach. The symmetrical or undirected relations are between two discourse elements of equal importance, or two Nuclei. Below we will explain how the difference between Satellites and Nuclei is considered in tree-based sentence rankings.</Paragraph>
      <Paragraph position="3"> 3.1.3 Data structures for representing discourse coherence As mentioned above, we used two alternative representations for discourse structure, tree- and non-tree based. In order to illustrate both data structures, consider (9) as an example:  (9) Example text 0. Susan wanted to buy some tomatoes.</Paragraph>
      <Paragraph position="4"> 1. She also tried to find some basil.</Paragraph>
      <Paragraph position="5"> 2. The basil would probably be quite expensive  at this time of the year.</Paragraph>
      <Paragraph position="6"> Figure 2 shows one possible tree representation of the coherence structure of (9)  . Sim represents a similarity relation, and elab an elaboration relation. Furthermore, nodes with a &amp;quot;Nuc&amp;quot; subscript are Nuclei, and nodes with a &amp;quot;Sat&amp;quot; subscript are Satellites.</Paragraph>
      <Paragraph position="7">  Figure 3 shows a non-tree representation of the coherence structure of (9). Here, the heads of the arrows represent the directionality of a relation.</Paragraph>
    </Section>
    <Section position="2" start_page="3" end_page="4" type="sub_section">
      <SectionTitle>
3.2 Coherence-based sentence ranking
</SectionTitle>
      <Paragraph position="0"> This section explains the algorithms for the treeand the non-tree-based sentence ranking approach.</Paragraph>
      <Paragraph position="1">  We used Marcu (2000)'s algorithm to determine sentence rankings based on tree discourse structures. In this algorithm, sentence salience is determined based on the tree level of a discourse segment in the coherence tree. Figure 4 shows Marcu (2000)'s algorithm, where r(s,D,d) is the rank of a sentence s in a discourse tree D with depth d. Every node in a discourse tree D has a promotion set promotion(D), which is the union of all Nucleus children of that node. Associated with every node in a discourse tree D is also a set of parenthetical nodes parentheticals(D) (for example, in &amp;quot;Mars - half the size of Earth - is red&amp;quot;, &amp;quot;half the size of earth&amp;quot; would be a parenthetical node in a discourse tree). Both promotion(D) and parentheticals(D) can be empty sets. Furthermore, each node has a left subtree,  based sentence rank (Marcu (2000)).</Paragraph>
      <Paragraph position="2"> The discourse segments in Carlson et al.</Paragraph>
      <Paragraph position="3"> (2002)'s database are often sub-sentential.</Paragraph>
      <Paragraph position="4"> Therefore, we had to calculate sentence rankings from the rankings of the discourse segments that form the sentence under consideration. We did this by calculating the average ranking, the minimal ranking, and the maximal ranking of all discourse segments in a sentence. Our results showed that choosing the minimal ranking performed best, followed by the average ranking, followed by the maximal ranking (cf. Section 4.4).  We used two different methods to determine sentence rankings for the non-tree coherence graphs  . Both methods implement the intuition that sentences are more important if other sentences relate to them (Sparck-Jones (1993)). The first method consists of simply determining the in-degree of each node in the graph. A node represents a sentence, and the in-degree of a node represents the number of sentences that relate to that sentence.</Paragraph>
      <Paragraph position="5"> The second method uses Page et al. (1998)'s PageRank algorithm, which is used, for example, in the Google(TM) search engine. Unlike just determining the in-degree of a node, PageRank takes into account the importance of sentences that relate to a sentence. PageRank thus is a recursive algorithm that implements the idea that the more important sentences relate to a sentence, the more important that sentence becomes. Figure 5 shows how PageRank is calculated. PR n is the PageRank of the current sentence, PR n-1 is the PageRank of the sentence that relates to sentence n, o</Paragraph>
      <Paragraph position="7"> out-degree of sentence n-1, and a is a damping parameter that is set to a value between 0 and 1.</Paragraph>
      <Paragraph position="8"> We report results for a set to 0.85 because this is a value often used in applications of PageRank (e.g.</Paragraph>
      <Paragraph position="9"> Ding et al. (2002); Page et al. (1998)). We also  Neither of these methods could be implemented for coherence trees since Marcu (2000)'s tree-based algorithm assumes binary branching trees. Thus, the in-degree for all non-terminal nodes is always 2. calculated PageRanks for a set to values between 0.05 and 0.95, in increments of 0.05; changing a did not affect performance.</Paragraph>
      <Paragraph position="11"/>
    </Section>
  </Section>
  <Section position="5" start_page="4" end_page="4" type="metho">
    <SectionTitle>
4 Experiments
</SectionTitle>
    <Paragraph position="0"> In order to test algorithm performance, we compared algorithm sentence rankings to human sentence rankings. This section describes the experiments we conducted. In Experiment 1, the texts were presented with paragraph breaks; in Experiment 2, the texts were presented without paragraph breaks. This was done to control for the effect of paragraph information on human sentence rankings.</Paragraph>
    <Section position="1" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.1 Materials for the coherence-based
approaches
</SectionTitle>
      <Paragraph position="0"> In order to test the tree-based approach, we took coherence trees for 15 texts from a database of 385 texts from the Wall Street Journal that were annotated for coherence (Carlson et al. (2002)).</Paragraph>
      <Paragraph position="1"> The database was independently annotated by six annotators. Inter-annotator agreement was determined for six pairs of two annotators each, resulting in kappa values (Carletta (1996)) ranging from 0.62 to 0.82 for the whole database (Carlson et al. (2003)). No kappa values for just the 15 texts we used were available.</Paragraph>
      <Paragraph position="2"> For the non-tree based approach, we used coherence graphs from a database of 135 texts from the Wall Street Journal and the AP Newswire, annotated for coherence. Each text was independently annotated by two annotators. For the 15 texts we used, kappa was 0.78, for the whole database, kappa was 0.84.</Paragraph>
    </Section>
    <Section position="2" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.2 Experiment 1: With paragraph
</SectionTitle>
      <Paragraph position="0"> information 15 participants from the MIT community were paid for their participation. All were native speakers of English and were naive as to the purpose of the study (i.e. none of the subjects was familiar with theories of coherence in natural language, for example).</Paragraph>
      <Paragraph position="1"> Participants were asked to read 15 texts from the Wall Street Journal, and, for each sentence in each text, to provide a ranking of how important that sentence is with respect to the content of the text, on an integer scale from 1 to 7 (1 = not important; 7 = very important). The texts were selected so  that there was a coherence tree annotation available in Carlson et al. (2002)'s database. Text lengths for the 15 texts we selected ranged from 130 to 901 words (5 to 47 sentences); average text length was 442 words (20 sentences), median was 368 words (16 sentences). Additionally, texts were selected so that they were about as diverse topics as possible.</Paragraph>
      <Paragraph position="2"> The experiment was conducted in front of personal computers. Texts were presented in a web browser as one webpage per text; for some texts, participants had to scroll to see the whole text. Each sentence was presented on a new line. Paragraph breaks were indicated by empty lines; this was pointed out to the participants during the instructions for the experiment.</Paragraph>
    </Section>
    <Section position="3" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.3 Experiment 2: Without paragraph
</SectionTitle>
      <Paragraph position="0"> information The method was the same as in Experiment 1, except that texts in Experiment 2 did not include paragraph information. Each sentence was presented on a new line. None of the 15 participants who participated in Experiment 2 had participated in Experiment 1.</Paragraph>
    </Section>
    <Section position="4" start_page="4" end_page="4" type="sub_section">
      <SectionTitle>
4.4 Results of the experiments
</SectionTitle>
      <Paragraph position="0"> Human sentence rankings did not differ significantly between Experiment 1 and Experiment 2 for any of the 15 texts (all Fs &lt; 1). This suggests that paragraph information does not have a big effect on human sentence rankings, at least not for the 15 texts that we examined. Figure 6 shows the results from both experiments for one text.</Paragraph>
      <Paragraph position="1"> We compared human sentence rankings to different algorithmic approaches. The paragraph-based rankings do not provide scaled importance rankings but only &amp;quot;important&amp;quot; vs. &amp;quot;not important&amp;quot;. Therefore, in order to compare human rankings to the paragraph-based baseline approach, we calculated point biserial correlations (cf. Bortz (1999)). We obtained significant correlations between paragraph-based rankings and human rankings only for one of the 15 texts.</Paragraph>
      <Paragraph position="2"> All other algorithms provided scaled importance rankings. Many evaluations of scalable sentence ranking algorithms are based on precision/recall/Fscores (e.g. Carlson et al. (2001); Ono et al. (1994)). However, Jing et al. (1998) argue that such measures are inadequate because they only distinguish between hits and misses or false alarms, but do not account for a degree of agreement. For example, imagine a situation where the human ranking for a given sentence is &amp;quot;7&amp;quot; (&amp;quot;very important&amp;quot;) on an integer scale ranging from 1 to 7, and Algorithm A gives the same sentence a ranking of &amp;quot;7&amp;quot; on the same scale, Algorithm B gives a ranking of &amp;quot;6&amp;quot;, and Algorithm C gives a ranking of &amp;quot;2&amp;quot;. Intuitively, Algorithm B, although it does not reach perfect performance, still performs better than Algorithm C.</Paragraph>
      <Paragraph position="3"> Precision/recall/F-scores do not account for that difference and would rate Algorithm A as &amp;quot;hit&amp;quot; but Algorithm B as well as Algorithm C as &amp;quot;miss&amp;quot;. In order to collect performance measures that are more adequate to the evaluation of scaled importance rankings, we computed Spearman's rank correlation coefficients. The rank correlation coefficients were corrected for tied ranks because in our rankings it was possible for more than one sentence to have the same importance rank, i.e. to have tied ranks (Horn (1942); Bortz (1999)).</Paragraph>
      <Paragraph position="4"> In addition to evaluating word-based and coherence-based algorithms, we evaluated one commercially available summarizer, the MSWord summarizer, against human sentence rankings.</Paragraph>
      <Paragraph position="5"> Our reason for including an evaluation of the MSWord summarizer was to have a more useful baseline for scalable sentence rankings than the paragraph-based approach provides.</Paragraph>
      <Paragraph position="6">  ) of each algorithm and human sentence ranking for the 15 texts. MarcuAvg refers to the version of Marcu (2000)'s algorithm where we calculated sentence rankings as the average of the rankings of all discourse segments that constitute that sentence; for MarcuMin, sentence rankings were the minimum of the rankings of all discourse segments in that sentence; for MarcuMax we selected the maximum of the rankings of all discourse segments in that sentence.</Paragraph>
      <Paragraph position="7">  performed numerically worse than most other algorithms, except MarcuMin. Figure 7 also shows that PageRank performed numerically better than all other algorithms. Performance was significantly better than most other algorithms</Paragraph>
      <Paragraph position="9"> 0.121). The difference between PageRank and tf.idf, WithParagraph was marginally significant (F(1,28) = 3.113, p = 0.089).</Paragraph>
      <Paragraph position="10"> As mentioned above, human sentence rankings did not differ significantly between Experiment 1 and Experiment 2 for any of the 15 texts (all Fs &lt; 1). Therefore, in order to lend more power to our statistical tests, we collapsed the data for each text for the WithParagraph and the NoParagraph condition, and treated them as one experiment.</Paragraph>
      <Paragraph position="11"> Figure 8 shows that when the data from Experiments 1 and 2 are collapsed, PageRank performed significantly better than all other algorithms except in-degree (two-tailed t-test results: MSWord: F(1, 58) = 48.717, p = 0.0001;  and human sentence rankings with collapsed data.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML