File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/p93-1041_metho.xml

Size: 6,130 bytes

Last Modified: 2025-10-06 14:13:32

<?xml version="1.0" standalone="yes"?>
<Paper uid="P93-1041">
  <Title>TEXT SEGMENTATION BASED ON SIMILARITY BETWEEN WORDS</Title>
  <Section position="4" start_page="0" end_page="287" type="metho">
    <SectionTitle>
SEGMENTS AND COHERENCE
</SectionTitle>
    <Paragraph position="0"> Several methods to capture segment boundaries have been proposed in the studies of text structure. For example, cue phrases play an important role in signaling segment changes. (Grosz and Sidner, 1986) However, such clues are not directly based on coherence which forms the clauses or sentences into a segment.</Paragraph>
    <Paragraph position="1"> Youmans (1991) proposed VMP (vocabulary management profile) as an indicator of segment boundaries. VMP is a record of the number of new vocabulary terms introduced in an interval of text. However, VMP does not work well on a high-density text. The reason is that coherence of a segment should be determined not only by reiteration of words but also by lexical cohesion.</Paragraph>
    <Paragraph position="2"> Morris and Hirst (1991) used Roget's thesaurus to determine whether or not two words have lexical cohesion. Their method can capture ahnost all the types of lexical cohesion, e.g. systematic and non-systematic semantic relation.</Paragraph>
    <Paragraph position="3"> However it does not deal with strength of cohesiveness which suggests the degree of contribution to coherence of the segment.</Paragraph>
    <Section position="1" start_page="0" end_page="287" type="sub_section">
      <SectionTitle>
Computing Lexieal Cohesion
</SectionTitle>
      <Paragraph position="0"> Kozima and Furugori (1993) defined lexical cohesiveness as semantic similarity between words, and proposed a method for measuring it. Similarity between words is computed by spreading activation on a semantic network which is systematically constructed from an English dictionary (LDOCE).</Paragraph>
      <Paragraph position="1"> The similarity cr(w,w') E \[0,1\] between words w,w ~ is computed in the following way: (1) produce an activated pattern by activating the node w; (2) observe activity of the node w t in the activated pattern. The following examples suggest the feature of the similarity ~r:</Paragraph>
      <Paragraph position="3"> The similarity ~r depends on the significance s(w) E \[0, 1\], i.e. normalized information of the word w in West's corpus (1953). For example: s(red) = 0.500955 , s(and) = 0.254294 .</Paragraph>
      <Paragraph position="5"> (produced from {red, alcoholic, drink}).</Paragraph>
      <Paragraph position="6"> The following examples show the relationship between the word significance and the similarity:</Paragraph>
      <Paragraph position="8"> LCP treats the text T as a word list without any punctuation or paragraph boundaries.</Paragraph>
      <Paragraph position="9"> Cohesiveness of a Word List Lexical cohesiveness c(Si) of the word list Si is defined as follows: c(S ) = w), where a(P(Si),w) is the activity value of the node w in the activated pattern P(Si). P(Si) is produced by activating each node w E Si with strength s(w)~/~ s(w). Figure 1 shows a sample pattern of {red, alcoholic, drink}. (Note that it has highly activated nodes like bottle and wine.) The definition of c(Si) above expresses that c(Si) represents semantic homogeneity of S/, since P(Si) represents the average meaning of w 6 S~.. For example: c(&amp;quot;Molly saw a cat. It was her family pet. She wished to keep a lion.&amp;quot;  = 0.403239 (cohesive), c( &amp;quot;There is no one but me. Put on your clothes. I can not walk more.&amp;quot; ---- 0.235462 (not cohesive).</Paragraph>
      <Paragraph position="10">  to semantically vary and makes c(Si) low. As shown in Figure 2, the segment boundaries can be detected by the valleys (minimum points) of LCP.</Paragraph>
      <Paragraph position="11"> The LCP, shown in Figure 3, has large hills and valleys, and also meaningless noise. The graph is so complicated that one can not easily deternfine which valley should be considered as a segment boundary.</Paragraph>
      <Paragraph position="12"> The shape of the window, which defines weight of words in it for pattern production, makes LCP smooth. Experiments on several window shapes (e.g. triangle window, etc.) shows that Hanning window is best for clarifying the macroscopic features of LCP.</Paragraph>
      <Paragraph position="13"> The width of the window also has effect on the macroscopic features of LCP, especially on separability of segments. Experiments on several window widths (A_ 5 ~ 60) reveals that the Hanning window of A = 25 gives the best correlation between LCP and segments.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="287" end_page="287" type="metho">
    <SectionTitle>
VERIFICATION OF LCP
</SectionTitle>
    <Paragraph position="0"> This section inspects the correlation between LCP and segment boundaries perceived by the human judgments. The curve of Figure 4 shows the LCP of the simplified version of O.Henry's &amp;quot;Springtime PS la Carte&amp;quot; (Thornley, 1960). The solid bars represent the histogram of segment boundaries reported by 16 subjects who read the text without paragraph structure.</Paragraph>
    <Paragraph position="1"> It is clear that the valleys of the LCP correspond mostly to the dominant segment boundaries. For example, the clear valley at i = 110 exactly corresponds to the dominant segment boundary (and also to the paragraph boundary shown as a dotted line).</Paragraph>
    <Paragraph position="2"> Note that LCP can detect segment changing of a text regardless of its paragraph structure. For example, i = 156 is a paragraph boundary, but neither a valley of the LCP nor a segment boundary; i = 236 is both a segment boundary and approximately a valley of the LCP, but not a paragraph boundary.</Paragraph>
    <Paragraph position="3"> However, some valleys of the LCP do not exactly correspond to segment boundaries. For example, the valley near i = 450 disagree with the segment boundary at i = 465. The reason is that lexical cohesion can not cover all aspect of coherence of a segment; an incoherent piece of text can be lexically cohesive.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML