File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/w00-1438_metho.xml

Size: 12,872 bytes

Last Modified: 2025-10-06 14:07:33

<?xml version="1.0" standalone="yes"?>
<Paper uid="W00-1438">
  <Title>An Efficient Text Summarizer Using Lexical Chains</Title>
  <Section position="3" start_page="268" end_page="269" type="metho">
    <SectionTitle>
2 A Linear Time Algorithm for Intra Intra Adjacent Other
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="268" end_page="268" type="sub_section">
      <SectionTitle>
Computing Lexical Chains Pgrph. Segment Segment
2.1 Overview Same 1 1 1 1
</SectionTitle>
      <Paragraph position="0"> Our research on lexical chains as an intermediate Synonym 1 1 0 &amp;quot;O &amp;quot; representation for automatic text summarization fol- Hypernym I 1 0 0 lows the research of Barzilay and Elhadad (1997). Hyponym 1 1 0 0 We use their results as a basis for the utility of Sibling 1 0 0 0 the methodology. The most substantial difference is that Barzilay and Elhadad create all possible chains explicitly and then choose the best possible chain, whereas we compute them implicitly.</Paragraph>
    </Section>
    <Section position="2" start_page="268" end_page="268" type="sub_section">
      <SectionTitle>
B+E's Algorithm
</SectionTitle>
      <Paragraph position="0"> the word itself. These scores are dynamic and can ....... 2~2 As mentioned above, WordNet is a lexical database that contains substantial semantic information. In order to facilitate efficient access, the WordNet noun database was re-indexed by line number as opposed to file position and the file was saved in a binary indexed format. The database access tools were then rewritten to take advantage of this new structure.</Paragraph>
      <Paragraph position="1"> The result of this work is that accesses to the Word-Net noun database can be accomplished an order of magnitude faster than with the original implementation. No additional changes to the WordNet databases were made. The re-indexing also provided a zero-based continuous numbering scheme that is important to our linear time algorithm. This importance will be noted below.</Paragraph>
      <Paragraph position="2"> Modifications ~to. Word.Net .............. be set ~ased ,on:segmentation information, dista.nce,</Paragraph>
    </Section>
    <Section position="3" start_page="268" end_page="268" type="sub_section">
      <SectionTitle>
2.3 Our Algorithm
</SectionTitle>
      <Paragraph position="0"> Step 1 For each word instance that is a noun For every sense of that word Compute all scored &amp;quot;meta-chains&amp;quot; Step 2 For each word instance Figure out which &amp;quot;meta-chain&amp;quot; it contributes most to Keep the word instance in that chain and remove it from all other Chains updating the scores of each &amp;quot;meta-chain&amp;quot;  Our basic lexical chain algorithm is described briefly in Figure 1. The algorithm takes a part of speech tagged corpus and extracts the nouns. Using WordNet to collect sense information for each of these noun instances, the algorithm then computes scored &amp;quot;nmta-chains&amp;quot; based on the collected information. A &amp;quot;meta-chain&amp;quot; is a representation of every possible lexical chain that can be computed starting with a word of a given sense. These meta-chains are scored in the following manner. As each word instance is added, its contribution, which is dependent on the scoring metrics used, is added to the &amp;quot;meta-chain&amp;quot; score. The contribution is then stored within and type of relation.</Paragraph>
      <Paragraph position="1"> Currently, segmentation is accomplished prior to using our algorithm by executing Hearst's text tiler (Hearst, 1994). The sentence numbers of each segment boundary are stored for use by our algorithm. These sentence numbers are used in conjunction with relation type as keys into a table of potential scores. Table 1 denotes sample metrics tuned to simulate the system devised by Barzilay and Elhadad (1997).</Paragraph>
      <Paragraph position="2"> At this point, the collection of &amp;quot;meta-chains&amp;quot; contalns all possible interpretations of the source document. The problem is that in our final representation, each word instance can exist in only one chain. To figure out which chain is the correct one, each word is examined.using the score contribution stored in Step 1 to determine which chain the given word instance contributes to most. By deleting the word instance from all the other chains, a representation where each word instance exists in precisely one chain remains. Consequently, the sum of the scores of all the chains is maximal. This method is analogous to finding a maximal spanning tree in a graph of noun senses. These noun senses are all of the senses of each noun instance in the document.</Paragraph>
      <Paragraph position="3"> From this representation, the highest scored chains correspond to the important concepts in the original document. These important concepts can be used to generate a summary from the source text. Barzilay and Elhadad use the notion of strong chains (i.e., chains whose scores are in excess of two standard deviations above the mean of all scores) to determine which chains to include in a summary.</Paragraph>
      <Paragraph position="4"> Our system can use this method, as well as several other methods including percentage compression and number of sentences.</Paragraph>
      <Paragraph position="5"> For a more detailed description of our algorithm please consult our previous work (Silber and McCoy, 2000).</Paragraph>
    </Section>
    <Section position="4" start_page="268" end_page="269" type="sub_section">
      <SectionTitle>
2.4 Runtime Analysis
</SectionTitle>
      <Paragraph position="0"> In this analysis, we will not consider the computational complexity of part of speech tagging, as that is not the focus of this research. Also, because the size  and structure of WordNet does not change from execution to execution of.aJae.algorit, hm, we shall take these aspects of WordNet to be constant. We will examine each phase of our algorithm to show that the extraction of these lexical chains can indeed be done in linear time. For this analysis, we define constants from WordNet 1.6 as denoted in Table 2.</Paragraph>
      <Paragraph position="1"> Extracting information from WordNet entails looking up each noun and extracting all synset, Hyponym/Hypernym, and sibling information. The runtime of these lookups over the entire document is: n * (log(Ca) + Cl * C2 + Cl * C5) When building the graph of all possible chains, we simply insert the word into all chains where a relation exists, which is clearly bounded by a constant (C6). The only consideration is the computation of the chain score. Since we store paragraph numbers represented within the chain as well as segment boundaries, we can quickly determine whether the relations are intra-paragraph, intra-segment, or adjacent segment. We then look up the appropriate score contribution from the table of metrics. Therefore, computing the score contribution of a given word is constant. The runtime of building the graph of all possible chains is: n*C6.5 Finding the best chain is equally efficient. For each word, each chain to which it belongs is examined. Then, the word is marked as deleted from all but the single chain whose score the word contributes to most. In the case of a tie, the lower sense nmnber from WordNet is used, since this denotes a more general concept. The runtime for this step is: n*C6.4 This analysis gives an overall worst case runtime of: n * 1548216 + log(94474 ) + 227370 and an average case runtime of: n * 326 + log(94474) + 275 While the constants are quite large, the algorithm is clearly O(n) in the number of nouns in the original document.</Paragraph>
      <Paragraph position="2"> At &amp;quot;first glance, &amp;quot;the'constants ~involved seem prohibitively large. Upon further analysis, however, we see that most synsets have very few parent child relations. Thus the worst case values maynot reflect the actual performance of our application. In addition, the synsets with many parent child relations tend to represent extremely general concepts. These synsets will most likely not appear very often as a direct synset for words appearing in a document.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="269" end_page="270" type="metho">
    <SectionTitle>
&amp;quot;2,;5 User ~Interface
</SectionTitle>
    <Paragraph position="0"> Our system currently can be used as a command line utility. The arguments allow the user to specify scoring metrics, summary length, and whether or not to search for collocations. Additionally, a web CGI interface has been added as a front end which allows a user to specify not just text documents, but html documents as well, and summarize them from the Internet. Finally, our system has been attached to a search engine. The search engine uses data from existing search engines on the Internet to download and summarize each page from the results. These summaries are then compiled and returned to the user on a single page. The final result is that a search results page is returned with automatically generated summaries.</Paragraph>
    <Section position="1" start_page="269" end_page="270" type="sub_section">
      <SectionTitle>
2.6 Comparison with Previous Work
</SectionTitle>
      <Paragraph position="0"> As mentioned above, this research is based on the work of Barzilay and Elhadad (1997) on lexical chains. Several differences exist between our method and theirs. First and foremost, the linear run-time of our algorithm allows documents to be summarized much faster. Our algorithm can summarize a 40,000 word document in eleven seconds on a Sun SPARC Ultra10 Creator. By comparison, our first version of the algorithm which computed lexical chains by building every possible interpretation like Barzilay and Elhadad took sLx minutes to extract chains from 5,000 word documents.</Paragraph>
      <Paragraph position="1"> The linear nature of our algorithm also has several other advantages. Since our algorithm is also linear in space requirements, we can consider all possible chains. Barzilay and Elhadad had to prune interpretations (enid thus chains) which did not seem promising. Our algorithm does not require pruning of chains.</Paragraph>
      <Paragraph position="2"> Our algorithm also allows us to analyze the iinportance of segmentation. Barzilay and Elhadad used segmentation to reduce the complexity of the problem of extracting chains. They basically built chains within a segment and combined these chains later when chains across segment boundaries shared a word in the same sense in common. While we include segmentation information in our algorithm, it  is merely because it might prove useful in disambiguating chains. The fact that we can use it or not allows our algorithm to test the importance of segmentation to proper-word ~ense disambiguation. It is important to note that on short documents, like those analyzed by Barzilay and Elhadad, segmentation appears to have little effect. There is some linguistic justification for this fact. Segmentation is generally computed using word frequencies, and our lexical chains algorithm generally captures the same type of information. On longer documents, our research has shown segmentation to have a much greater effect.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="270" end_page="270" type="metho">
    <SectionTitle>
3 Current Research and Future
Directions
</SectionTitle>
    <Paragraph position="0"> Some issues which are not currently addressed by this research are proper name disambiguation and anaphora resolution. Further, while we attempt to locate two-word collocations using WordNet, a more robust collocation extraction technique is warranted.</Paragraph>
    <Paragraph position="1"> One of the goals of this research is to eventually create a system which generates natural language summaries. Currently, the system uses sentence selection as its method of generation. It is our contention that regardless of how well an algorithm for extracting sentences may be, it cannot possibly create quality summaries. It seems obvious that sentence selection will not create fluent, coherent text. Further, our research shows that completeness is a problem. Because information extraction is only at the sentence boundary, information which may be very important may be left out if a highly compressed summary is required.</Paragraph>
    <Paragraph position="2"> Our current research is examining methods of using all of the important sentences determined by our lexical chains algorithm as a basis for a generation system. Our intent is to use the lexical chains algorithm to determine what to summarize, and then a more classical generation system to present the information as coherent text. The goal is to combine and condense all significant information pertaining to a given concept which can then be used in generation. null</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML