File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2189_metho.xml

Size: 21,814 bytes

Last Modified: 2025-10-06 14:14:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2189">
  <Title>Ranking Text Units According to Textual Saliency, Connectivity and Topic Aptness</Title>
  <Section position="3" start_page="0" end_page="1157" type="metho">
    <SectionTitle>
2 Background
</SectionTitle>
    <Paragraph position="0"> Lexical cohesion has been widely used in text analysis for the comparative assessment of saliency and connectivity of text fragments.</Paragraph>
    <Paragraph position="1"> Following Hoey (1991), a simple way of computing lexical cohesion in a text is to segment the text into units (e.g sentences) and to count non-stop words 1 which co-occur in each pair of distinct text units, as shown in Table 2 for the text in Table 1. Text units which contain a greater number of shared non-stop words are more likely to provide a better abridgement of the original text for two reasons: * the more often a word with high informational content occurs in a text, the more topical and germane to the text the word is likely to be, and * the greater the number of times two text units share a word, the more connected they are likely to be.</Paragraph>
    <Paragraph position="2"> Text saliency and connectivity for each text unit is therefore established by summing the number of shared words associated with the text unit. According to Hoey, the number of links (e.g. shared words) across two text units must be above a certain threshold for the two text units to achieve a lexical cohesion rank. For example, if only individual scores greater than 2 1Non-stop words can be intuitively thought of as words which have high informational content. They usually exclude words with a very high fequency of occurrence, especially closed class words such as determiners, prepositions and conjunctions (Fox, 1992).</Paragraph>
  </Section>
  <Section position="4" start_page="1157" end_page="1157" type="metho">
    <SectionTitle>
#2# NEW YORK (Reuter) - Apple is actively
</SectionTitle>
    <Paragraph position="0"> looking for a friendly merger partner, according to several executives close to the company, the New York Times said on Thursday.</Paragraph>
    <Paragraph position="1"> #3# One executive who does business with Apple said Apple employees told him the company was again in talks with Sun Microsystems, the paper said.</Paragraph>
  </Section>
  <Section position="5" start_page="1157" end_page="1158" type="metho">
    <SectionTitle>
#4# On Wednesday, Saudi Arabia's Prince
</SectionTitle>
    <Paragraph position="0"> Alwaleed Bin Talal Bin Abdulaziz A1 Saud said he owned more than five percent of the computer maker's stock, recently buying shares on the open market for a total of $115 million.</Paragraph>
    <Paragraph position="1"> #5# Oracle Corp Chairman Larry Ellison confirmed on March 27 he had formed an independent investor group to gauge interest in taking over Apple.</Paragraph>
    <Paragraph position="2"> #6# The company was not immediately available to comment.</Paragraph>
    <Paragraph position="3"> Table h Sample text with numbered text units  pairs.</Paragraph>
    <Paragraph position="4"> are taken into account, the final scores and consequent ranking order computable from Table 2 are: . first: text unit #2# (final score: 7); * second: text unit #3# (final score: 4), and * third: text unit #1# (final score: 3). A text abridgement can be obtained by selecting text units in ranking order according to the text percentage specified by the user. For example, a 35% abridgement of the text in Table 2 would result in the selection of text units #2# and #3#.</Paragraph>
    <Paragraph position="5"> As Hoey points out, additional techniques can be used to refine the assessment of lexical cohesion. A typical example is the use of thesaurus functions such as synonymy and hyponymy to extend the notion of word sharing across text units, as exemplified in Hirst and St-Onge (1997) and Barzilay and Elhadad (1997) with reference to WordNet (Miller et al., 1990). Such an extension may improve on the assessment of textual saliency and connectivity thus providing better generic summaries, as argued in Barzilay and Elhadad (1997).</Paragraph>
    <Paragraph position="6"> There are basically two problems with the uses of lexical cohesion for summarization reviewed above. First, the basic algorithm requires that (i) all unique pairwise permutations of distinct text units be processed, and (ii) all cross-sentence word combinations be evaluated for each such text unit pair. The complexity of this algorithm will therefore be O(n 2 * m 2) for n text units in a text and m words in a text unit of average length in the text at hand. This estimate may get worse as conditions such as synonymy and hyponymy are checked for each word pair to extend the notion of lexical cohesion, e.g. using WordNet as in Barzilay and E1hadad (1997). Consequently, the approach may not be suitable for on-line use with longer input texts. Secondly, the use of thesauri envisaged in both Hirst and St-Onge (1997) and Barzilay and Elhadad (1997) does not address the question of topical aptness. Thesaural relations such as synonymy and hyponymy are meant to capture word similarity in order to assess lexical cohesion among text units, and not to provide a thematic characterization of text units. 2 Consequently, it will not be possible to index and retrieve text units in term of topic aptness according to users' needs. In the remaining part of the paper, we will show how these concerns of efficiency and thematic characterization can be addressed with specific reference to a system performing generic and query-based indicative 2Notice incidentally that such thematic characterization could not be achieved using thesauri such as Word-Net since since WordNet does not provide an arrangement of synonym sets into classes of discourse topics (e.g. finance, sport, health).</Paragraph>
    <Paragraph position="7">  summaries.</Paragraph>
  </Section>
  <Section position="6" start_page="1158" end_page="1159" type="metho">
    <SectionTitle>
3 An Efficient Method for
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="1158" end_page="1158" type="sub_section">
      <SectionTitle>
Computing Lexical Cohesion
</SectionTitle>
      <Paragraph position="0"> The method we are about to describe comprises three phases: * a preparatory phase where the input text undergoes a number of normalizations so as to facilitate the process of assessing lexical cohesion; * an indexing phase where the sharing of elements indicative of lexical cohesion is assessed for each text unit, and * a ranking phase where the assessment of lexical cohesion carried out in the indexing phase is used to rank text units.</Paragraph>
    </Section>
    <Section position="2" start_page="1158" end_page="1158" type="sub_section">
      <SectionTitle>
3.1 Preparatory Phase
</SectionTitle>
      <Paragraph position="0"> During the preparatory phase, the text undergoes a number of normalizations which have the purpose of facilitating the process of computing lexical cohesion, including:  * removal of formatting commands * text segmentation, i.e. breaking the input text into text units * part-of-speech tagging * recognition of proper names * recognition of multi-word expressions * removal of stop words * word tokenization, e.g. lemmatization.</Paragraph>
    </Section>
    <Section position="3" start_page="1158" end_page="1159" type="sub_section">
      <SectionTitle>
3.2 Indexing Phase
</SectionTitle>
      <Paragraph position="0"> In providing a solution for the efficiency problem, our aim is to compute lexical cohesion for all text units in a text without having to process all cross-sentence word combinations for all unique and distinct pair-wise text unit permutations. To achieve this objective, we index each text unit with reference to each word occurring in it and reverse-index each such word with reference to all other text units in which the word occurs, as shown in Table 3 for text unit #2#. The sharing of words can then be measured by counting all occurrences of identical text units linked to the words associated with the &amp;quot;head&amp;quot; text unit (#2# in Table 3), as shown in Table 4. By repeating the two opera- null I &lt; company {#3#,#6#} &gt; #2# &lt; executive {#3#} &gt; &lt; look {#1#} &gt; &lt; partner {#i#} &gt;  which text unit #2# has with all other text units tions described above for each text unit in the text shown in Table 1, we will obtain a table of lexical cohesion links equivalent to that shown on Table 2.</Paragraph>
      <Paragraph position="1"> According to this method, we are still processing pair-wise permutations of text units to collect lexical cohesion links as shown in Table 4. However, there are two important differences with the original algorithm. First, noncohesive text units are not taken into account (e.g. the pair #2#-#4# in the example under analysis); therefore, on average the number of text unit permutations will be significantly smaller than that processed in the original algorithm. With reference to the text in Table 1, for example, we would be processing 7 text unit permutations less which is over 41% of the number of text unit permutations which need computing according to the original algorithm, as shown in Table 2. Secondly, although pair-wise text unit combinations are still processed, we avoid doing so for all cross-sentence word permutations. Consequently, the complexity of the algorithm is O(n 2 * m) for n text units in a text and m words in a text unit of average length in the text as compared to O(n 2 , m 2) for the original algorithm. 3 ZA further improvement yet would be to avoid counting lexical cohesion links per text unit as in Table 4, and just sum all text unit occurrences associated with reversed-indexed words in structures such as those in Table 3, e.g. the lexical cohesion score for text unit #2# would simply be 9. This would remove the need of processing pair-wise text unit permutations for the assessment of lexical cohesion links, thus bringing the complexity clown to O(n * m). Such further step, however, would preempt the possibility of excluding lexical cohesion scores for text unit pairs which are below a given threshold.</Paragraph>
      <Paragraph position="2">  * Let TRSH be the lexical cohesion threshold TU be the current text unit LC Tu be the current lexical cohesion score of TU (i.e. LC Tv is the count of tokenized words TU shares with some other text unit) - CLevel. be the level of the current lexical cohesion score calculated as the difference between LC Tv and TRSH - Score be the lexical cohesion score previously assigned TU (if any) - Level be the level for the lexical cohesion score previously assigned to TU (if any) - if LC TU -~ 0, then do nothing - else~ if the scoring structure (Level, TU, Score) exists, then * if Level &gt; CLevel, then do nothing . else, if Level = CLevel, then the new scoring structure is (Level, TU, Score + LC Tu ) * else, if CLevel &gt; 0, then * if Level &gt; 0, then the new scoring structure is (1, TU, Score + LC TU) * else, if Level &lt; O, then the new scoring structure is (1, TU, LC TU) . else the new scoring structure is (CLevel, TU, LC ~'u) - else * if CLevel &gt; 0, then create the scoring structure (1, TU, LC Tu) * else create the scoring structure ( C Level, TU, LC T~\] )</Paragraph>
    </Section>
    <Section position="4" start_page="1159" end_page="1159" type="sub_section">
      <SectionTitle>
3.3 Ranking Phase
</SectionTitle>
      <Paragraph position="0"> Each text unit is ranked with reference to the total number of lexical cohesion scores collected, such as those shown in Table 4. The objective of such a ranking process is to assess the import of each score and combine all scores into a rank for each text unit. In performing this assessment, provisions are made for a threshold which specifies the minimal number of links required for text units to be lexically cohesive, following Hoey's approach (see SS1). The procedure outlined in Table 5 describes the scoring methodology adopted. Ranking a text unit according to this procedure involves adding the lexical cohesion scores associated with the text unit which are either  * Costant values - TRSH = 2 - TU = $2# * Scoring text unit #2$ - Lexical cohesion with text unit #6# * LC TU = 1 . CLevel -- -1 (i.e. LC Tu- TRSH) * no previous scoring structure . current scoring structure: (-1,#2#, 1) - Lexical cohesion with text unit #S# * LC TU ~. 1 * CLevel = -1 . previous scoring structure: i-l, #2#, 1) . current scoring structure: (-1, #2#, 2) - Lexical cohesion with text unit #3# * LC Tu = 4 * CLevel = 2 . previous scoring structure: i-I, #2#, 2) . current scoring structure: (0, #25, 4) - Lexical cohesion with text unit #1# * LC TU = 3 . CLevel = 1 . previous scoring structure: (1, #2#, 4) * final scoring structure: (1, #2#, 7)  sion.</Paragraph>
      <Paragraph position="1"> * above the threshold, or * below the threshold and of the same magnitude. null If the threshold is 0, then there is a single level and the final score is the sum of all scores. Suppose for example, we are ranking text units #2# with reference to the scores in Table 4 with a lexical cohesion threshold of 2. In this case we apply the ranking procedure in Table 5 to each score in Table 4, as shown in Table 6. Following this procedure for all text units in Table 1, we will obtain the ranking in Table 7.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1159" end_page="1161" type="metho">
    <SectionTitle>
4 Assessing Topic Aptness
</SectionTitle>
    <Paragraph position="0"> When used with a dictionary database providing information about the thematic domain of words (e.g. business, politics, sport), the same method can be slightly modified to compute lexical cohesion with reference to discourse topics rather than words. Such an application makes  viding subject domain information.</Paragraph>
    <Paragraph position="1"> it possible to detect the major topics of a document automatically and to assess how well each text unit represents these topics.</Paragraph>
    <Paragraph position="2"> In our implementation, we used the &amp;quot;subject domain codes&amp;quot; provided in the machine readable version of CIDE (Cambridge International Dictionary of English (Procter, 1995)). Table 8 provides an illustrative example of the information used. Both the indexing and ranking phases are carried out with reference to subject domain codes rather than words.</Paragraph>
    <Paragraph position="3"> As shown in Table 9 for text unit #1#, the indexing procedure provides a record of the sub-ject domain codes occurring in each text unit; each such subject code is reverse-indexed with reference to all other text units in which the subject code occurs. In addition, a record of which word originates which cohesion link is kept for each text unit index. The main function of keeping track of this information is to avoid counting lexical cohesion links generated by overlapping domain codes which relate to the same word -- for words associated with more than one code. Such provision is required in order to avoid, or at least reduce the chances of, counting codes which are out of context, that is codes which relate to senses of the word other than the intended sense. For example, the word partner occurring in the first two text units of the text in Table 1 is associated with four dif- null codes with pointers to the other text units in which they occur.</Paragraph>
    <Paragraph position="4">  induced by subject domain codes for text unit #I#.</Paragraph>
    <Paragraph position="5"> ferent subject codes pertaining to the domains of Dance (DA), Finance (F), Marriage (M) and Team Games (TG), as shown in Table 8. However, only the Finance reading is appropriate in the given context. If we count the cohesion links generated by partner we would therefore count three incorrect cohesion links. By excluding all four cohesion links, the inclusion of contextually inappropriate cohesion links is avoided. Needless to say, we will also throw away the correct cohesion link (F in this case). However, this loss can be redressed if we also compute lexical cohesion links generated from shared words across text units as discussed in SS2, and combine the results with the lexical cohesion ranks obtained with subject domain codes.</Paragraph>
    <Paragraph position="6"> The lexical cohesion links for text unit #1# will therefore be scored as shown in Table 10, where associations between link scores and relevant codes as well as the words generating them are maintained. As can be observed, only the appropriate code expansion F (Finance) for the words partner and company is taken into account. This is simply because F is the only code shared by the two words (see Table 8).</Paragraph>
    <Paragraph position="7"> As mentioned earlier, lexical cohesion links induced by subject domain scores can be used to rank text units using the procedure shown in Table 5. Other uses include providing a topic profile of the text and an indication of how well each text unit represents a given topic. For example, the code BZ (Business &amp; Commerce) is associated with the words:  After calculating the lexical cohesion links for all text units following the method illustrated in Tables 9-10 for text unit #1#, the links scored for the code BZ will be as shown in Table 11. By repeating this operation for all codes for which there are lexical cohesion scores -- F, FA, IV and CN for the text under analysis -- we could then count all text unit pairs which each code relates, as shown in Table 12. The relations between subject domain codes and text unit pairs in Table 12 can subsequently be turned into percentage ratios to provide a topic/theme profile of the text as shown in Table 13.</Paragraph>
    <Paragraph position="8"> By keeping track of the links among text units, relevant codes and their originating words, it is also possible to retrieve text units on the basis of specific subject domain codes or specific words. When retrieving on specific  according to the distribution of subject domain codes across text units shown in Table 12.</Paragraph>
    <Paragraph position="9"> words, there is also the option of expanding the word into subject domain codes and using these to retrieve text units. The retrieved text units can then be ordered according to the ranking order previously computed.</Paragraph>
  </Section>
  <Section position="8" start_page="1161" end_page="1162" type="metho">
    <SectionTitle>
5 Applications, Extensions and
Evaluation
</SectionTitle>
    <Paragraph position="0"> An implementation of this approach to lexical cohesion has been used as the driving engine of a summarization system developed at SHARP Laboratories of Europe. The system is designed to handle requests for both generic and query-based indicative summaries. The level-based differentiation of text units obtained through the ranking procedure discussed in SS3.3, is used to select the most salient and better connected portion of text units in a text corresponding to the summary ratio requested by the user. In addition, the user can display a topic profile of the input text, as shown in Table 13 and choose whichever code(s) s/he is interested in, specify a summary ratio and retrieve the wanted portion of the text which best represents the topic(s) selected. Query-based summaries can also be issued by entering keywords; in this case there is the option of expanding key-words into codes and use these to issue a summary query.</Paragraph>
    <Paragraph position="1"> The method described can also be used to develop a conceptal indexing component for information retrieval, following Dobrov et al. (1997). Because an attempt is made to prune contextually inappropriate sense expansions of words, the present method may help reducing the ambiguity problem.</Paragraph>
    <Paragraph position="2"> Possible improvements of this approach can be implemented taking into account additional ways of assessing lexical cohesion such as: * the presence of synonyms or hyponyms across text units (Hoey, 1991; Hirst and St-Onge, 1997; Barzilay and Elhadad 1997);  * the presence of lexical cohesion established with reference to lexical databases offering a semantic classification of words other than synonyms, hyponyms and subject domain codes; * the presence of near-synonymous words across text units established by using a method for estimating the degree of semantic similarity between word pairs such as the one proposed by Resnik (1995); * the presence of anaphoric links across text units (Hoey, 1991; Boguraev &amp; Kennedy, 1997), and * the presence of formatting commands as indicators of the relevance of particular types of text fragments.</Paragraph>
    <Paragraph position="3"> To evaluate the utility of the approach to lexical cohesion developed for summarization, a testsuite was created using 41 Reuter's news stories and related summaries (available at http ://www. yahoo, com/headlines/news/), by annotating each story with best summary lines. In one evaluation experiment, summary ratio was set at 20% and generic summaries were obtained for the 41 texts. On average, 60~0 of each summary contained best summary lines. The ranking method used in this evaluation was based on combined lexical cohesion scores based on lemmas and their associated subject domain codes given in CIDE. Summary results obtained with the Autosummarize facility in Microsoft Word 97 were used as baseline for comparison. On average, only 30% of each summary in Word 97 contained best summary lines. In future work, we hope to corroborate these results and to extend their validity with reference to query-based indicative summaries using the evaluation framework set within the context of SUMMAC (Automatic Text Summarization Conference, see http ://www. tipster, org/).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML