File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-1123_abstr.xml

Size: 16,082 bytes

Last Modified: 2025-10-06 13:49:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1123">
  <Title>Linear Segmentation and Segment Significance</Title>
  <Section position="1" start_page="0" end_page="199" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> We present a new method for discovering a segmental discourse structure of a document while categorizing each segment's function and importance. Segments are determined by a zero-sum weighting scheme, used on occurrences of noun phrases and pronominal forms retrieved from the document. Segment roles are then calculated from the distribution of the terms in the segment. Finally, we present results of evaluation in terms of precision and recall which surpass earlier approaches'.</Paragraph>
    <Paragraph position="1"> Introduction Identification of discourse structure can be extremely useful to natural language processing applications such as automatic text summarization or information retrieval (IR). For example, a summarization agent might chose to summarize each discourse segment separately. Also, segmentation of a document into blocks of topically similar text can assist a search engine in choosing to retrieve or highlight a segment in which a query term occurs. In this paper, we present a topical segmentation program that achieves a 10% increase in both precision and recall over comparable previous work.</Paragraph>
    <Paragraph position="2"> In addition to segmenting, the system also labels the function of discovered discourse ' This material is based upon work supported by the National Science Foundation under Grant No. (NSF #IRI-9618797) and by the Columbia University Center for Research on Information Access.</Paragraph>
    <Paragraph position="3"> segments as to their relevance towards the whole. It identifies 1) segments that contribute some detail towards the main topic of the input, 2) segments that summarize the key points, and 3) segments that contain less important information. We evaluated our segment classification as part of a summarization system that utilizes highly pertinent segments to extract key sentences.</Paragraph>
    <Paragraph position="4"> We investigated the applicability of this system on general domain news articles. Generally, we found that longer articles, usually beyond a three-page limit, tended to have their own prior segmentation markings consisting of headers or bullets, so these were excluded. We thus concentrated our work on a corpus of shorter articles, averaging roughly 800-1500 words in length: 15 from the Wall Street Journal in the  5 from the on-line The Economist from 1997. We constructed an evaluation standard from human segmentation judgments to test our output.</Paragraph>
    <Paragraph position="5"> 1 SEGMENTER: Linear Segmentation  For the purposes of discourse structure identification, we follow a formulation of the problem similar to Hearst (1994), in which zero or more segment boundaries are found at various paragraph separations, which identify one or more topical text segments. Our segmentation is linear, rather than hierarchical (Marcu 1997 and Yaari 1997), i.e. the input article is divided into a linear sequence of adjacent segments.</Paragraph>
    <Paragraph position="6">  Our segmentation methodology has three distinct phases (Figure 1), which are executed sequentially. We will describe each of these phases in detail.</Paragraph>
    <Paragraph position="7"> \[. &amp;quot;-&amp;quot;~-~&amp;quot;-~Weigh ~--~Score</Paragraph>
    <Paragraph position="9"/>
    <Section position="1" start_page="197" end_page="197" type="sub_section">
      <SectionTitle>
1,1 Extracting Useful Tokens
</SectionTitle>
      <Paragraph position="0"> The task of determining segmentation bre~s depends fundamentally on extracting useful topic information from the text. We extract three categories of information, which reflect the topical content of a text, to be referred to as terms for the remainder of the paper:  &amp;quot;1. proper noun phrases; 2. common noun phrases; 3. personal and possessive pronouns.</Paragraph>
      <Paragraph position="1">  In order to find these three types of terms, we first tag the text with part of speech (POS) information. Two methods were investigated for assigning POS tags to the text: I) running a specialized tagging program or 2) using a simple POS table lookup. We chose to use the latter to assign tags for time efficiency reasons (since the segmentation task is often only a preprocessing stage), but optimized the POS table to favor high recall of the 3 term types, whenever possible 2. The resulting system was faster than the initial prototype that used the former approach by more than a magnitude, with a slight decline in precision that was not statistically significant. However, if a large system requires accurate tags after segmentation and the cost of tagging is not an issue, then tagging should be used instead of lookup.</Paragraph>
      <Paragraph position="2"> We based our POS table lookup on NYU's COMLEX (Grishman et al. 1994). After simplifying'COMLEX's categories to only reflect information important to to our three term types, we flattened all multi-category words (i.e. &amp;quot;jump&amp;quot; as V or N) to a single category by a strategy motivated to give high term recall (i.e. '~jump&amp;quot; maps to N, because NP is a term type.) Once POS tags have been assigned, we can retrieve occurrences of noun phrases by searching the document for this simple regular expression: (Adj I Noun)* Noun This expression captures a simple noun phrase without any complements. More complex noun phrases such as &amp;quot;proprietor of Stag's Leap Wine Cellars in Napa Valley&amp;quot; are captured as three different phrases: &amp;quot;proprietor&amp;quot;, &amp;quot;Stag's Leap Wine Cellars&amp;quot; and &amp;quot;Napa Valley&amp;quot;. We deliberately made the regular expression less powerful to capture as many noun phrases as possible, since the emphasis is on high NP recall.</Paragraph>
      <Paragraph position="3"> After retrieving the terms, a post-processing phase combines related tokens together. For possessive pronouns, we merge each possessive with its appropriate personal pronoun (&amp;quot;my&amp;quot; or &amp;quot;mine&amp;quot; with 'T', etc.) For noun phrases, we canonicalize noun phrases according to their heads. For example, if the noun phrases &amp;quot;red wine&amp;quot; and &amp;quot;wine&amp;quot; are found in a text, we subsume the occurrences of &amp;quot;red wine&amp;quot; into the occurrences of &amp;quot;wine&amp;quot;, under the condition that there are no other &amp;quot;wine&amp;quot; headed phrases, such as &amp;quot;white wine&amp;quot;. Finally, we perform thresholding to filter irrelevant words, following the guidelines set out by Justeson and Katz (1995). We use a frequency threshold of two occurrences to determine topicality, and discard any pronouns or noun phrases that occur only once.</Paragraph>
    </Section>
    <Section position="2" start_page="197" end_page="199" type="sub_section">
      <SectionTitle>
1.2 Weighting Term Occurrences
</SectionTitle>
      <Paragraph position="0"> Once extracted, terms are then evaluated to arrive at segmentation.</Paragraph>
      <Paragraph position="1">  Given a single term (noun phrase or pronominal form) and the distribution of its occurrences, we link related occurrences together. We use proximity as our metric for relatedness. If two occurrences of a term occur within n sentences, we link them together as a single unit, and repeat until no larger units can be built. This idea is a simpler interpretation of the notion of lexical chains. Morris and Hirst (1991) first proposed this notion to chain semantically related words together via a  thesaurus, while we chose only repetition of the same stem word'.</Paragraph>
      <Paragraph position="2"> However, for these three categories of terms we noticed that the linking distance differs depending on the type of term in question, with proper nouns having the maximum allowable distance and the pronominal forms having the least. Proper nouns generally refer to the same entity, almost regardless of the number of intervening sentences. Common nouns often have a much shorter scope of reference, since a single token can be used to repeatedly refer to different instances of its class. Personal pronouns scope even more closely, as is expected of an anaphoric or referring expression where the referent can be, by def'mition, different over an active discourse. Any term occurrences that were not linked were then dropped from further consideration. Thus, link length or linking distance refers to the number of sentences allowed to intervene between two occurrences of a term.</Paragraph>
      <Paragraph position="3">  After links are established, weighting is assigned. Since paragraph level boundaries are not considered in the previous step, we now label each paragraph with its positional relationship to each term's link(s). We describe these four categories for paragraph labeling and illustrate them in the figure below.</Paragraph>
      <Paragraph position="4"> Front: a paragraph in which a link begins.</Paragraph>
      <Paragraph position="5"> During: a paragraph in which a link occurs, but is not a front paragraph.</Paragraph>
      <Paragraph position="6"> Rear: a paragraph in which a link just stopped occurring the paragraph before.</Paragraph>
      <Paragraph position="7"> No link: any remaining paragraphs.</Paragraph>
      <Paragraph position="8"> paras 1 2 3 4 5 7 8 sents 12345678901234567890123456789012345 wine : Ixxl ix21 type : n f d r n f d Figure 2zt A term &amp;quot;wine&amp;quot;, and its occurrences and type. We also tried to semantically cluster terms by using Miller et al. (1990)'s WordNet 1.5 with edge counting to determine relatedness, as suggested by Hearst (1997). However, results showed only minor improvement in precision and over a tenfold increase in execution time.</Paragraph>
      <Paragraph position="9"> Figure 2a shows the algorithm as developed thus far in the paper, operating on the term &amp;quot;wine&amp;quot;. The term appears a total of six times, as shown by the numbers in the central row. These occurrences have been grouped together into two term links, as joined by the &amp;quot;x&amp;quot;s. The bottom &amp;quot;type&amp;quot; line labels each paragraph with one of the four paragraph relations. We see that it is possible for a term to have multiple front or rear paragraphs, as illustrated, since a term's occurrences might be separated between disparate links.</Paragraph>
      <Paragraph position="10"> Then, for each of the four categories of paragraph labeling mentioned before, and for each of the three term types, we assign a different segmentation score, listed in Table 1, whose values were derived by training, to be discussed in section 1.2.4.</Paragraph>
      <Paragraph position="12"> of weighting and linking scheme used in SEGMENTER; star'red scores to be c.alculatlxl later. For noun phrases, we assume that the introduction of the term is a point at which a new topic may start; this is Youmans's (1991) Vocabulary Management Profile. Similarly, when a term is no longer being used, as in rear paragraphs, the topic may be closed. This observation may not be as direct as &amp;quot;vocabulary introduction&amp;quot;, and thus presumably not as strong a marker of topic change as the former. Moreover, paragraphs in which the link persists throughout indicate that a topic continues; thus we see a negative score assigned to during paragraphs. When we apply the same paragraph labeling to pronoun forms, the same rationale applies with some modifications. Since the majority of pronoun referents occur before the pronoun (i.e. anaphoric as opposed to cataphoric), we do not weigh the front boundary heavily, but instead place the emphasis on the rear.</Paragraph>
      <Paragraph position="13">  When we iterate the weighting process described above over each term, and total the scores assigned, we come up with a numerical score for an indication of which paragraphs are more likely to beh a topical boundary. The higher the numerical score, the higher the likelihood that the paragraph is a beginning of a new topical segment. The question then is what should the threshold be? paras 1 2 3 4 5 7 8  sents 12345678901234567890123456789012345 wine : ixxl Ix21 type : n f d r n f d score:&amp;quot; i0 -3 8 &amp;quot; i0 -3 sum to balance in zero-sum weighting: +12 zero :-6 i0 -3 8 -6 i0 -3  assignment to paragraphs.</Paragraph>
      <Paragraph position="14"> To solve this problem, we zero-sum the weights for each individual term. To do this, we first sum the total of all scores assigned to any front,, rear and during paragraphs that we have previously assigned a score to and then evenly distribute to the remaining no link paragraphs the negative of this sum. This ensures that the net sum of the weight assigned by the weighting of each term sums to zero, and thus the weighting of the entire article, also sums to zero. In cases where no link paragraphs do not exist for a term, we cannot perform zero-summing, and take the scores assigned as is, but this is in small minority of cases. This process of weighting followed by zero-summing is shown by the extending the &amp;quot;wine&amp;quot; example, in Figure 2b, as indicated by the score and zero lines.</Paragraph>
      <Paragraph position="15"> With respect to individual paragraphs, the summed score results in a positive or negative total. A positive score indicates a boundary, i.e. the beginning of a new topical segment, whereas a negative score indicates the continuation of a segment. This use of zero sum weighting makes the problem of finding a threshold trivial, since the data is normalized around the value zero.</Paragraph>
      <Paragraph position="16"> L2.4 Finding Local Maxima Examination of the output indicated that for long and medium length documents, zero-sum weighting would yield good results. However, for the documents we investigated, namely documents of short length (800-1500 words), we have observed that multiple consecutive paragraphs, all with a positive summed score, actually only have a single, true boundary. In these cases, we take the maximal valued paragraph for each of these clusters of positive valued paragraphs as the only segment boundary. Again, this only makes sense for paragraphs of short length, where the distribution of words would smear the segmentation values across paragraphs. In longer length documents, we do not expect this phenomenon to occur, and thus this process can be skipped. After finding local maxima, we arrive at the finalized segment boundaries.</Paragraph>
    </Section>
    <Section position="3" start_page="199" end_page="199" type="sub_section">
      <SectionTitle>
1.3 Algorithm Training
</SectionTitle>
      <Paragraph position="0"> To come up with the weights used in the segmentation algorithm and to establish the position criteria used later in the segment relevance calculations, we split our corpus of articles in four sets and perforrre.d 4-fold cross validation training, intentionally keeping the five Economist articles together in one set to check for domain specificity.</Paragraph>
      <Paragraph position="1"> Our training phase consisted of running the algorithm with a range of different parameter settings to determine the optimal settings. We tried a total of 5 x 5 x 3 x 3 = 225 group settings for the four variables (front, rear, during weights and linking length settings) for each of the three (common nouns, proper nouns and pronoun forms) term types. The results of each run were compared against a standard of user segmentation judgments, further discussed in Section 3.</Paragraph>
      <Paragraph position="2"> The results noted that a sizable group of settings (approximately 10%) seemed to produce very close to optimal results. This group of settings was identical across all four cross validation training runs, so we believe the algorithm is fairly robust, but we cannot safely conclude this without constructing a more extensive training/testing corpus.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML