File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2187_metho.xml

Size: 19,936 bytes

Last Modified: 2025-10-06 14:13:47

<?xml version="1.0" standalone="yes"?>
<Paper uid="C94-2187">
  <Title>A DISCOURSE GRAMMATICO-~TATISTICAL APPROACH TO PARTITIONING</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
A DISCOURSE GRAMMATICO-~TATISTICAL APPROACH TO
PARTITIONING
</SectionTitle>
    <Paragraph position="0"/>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> The paper presents a new approach to text segmentation --. which concerns dividing a text into coherent discourse units. The approach builds on tile ttleory of discourse segment (Nomoto and Nitta, 1993), incorporating ideas from the research on information retrieval (Salton, 1988). A discourse segment has to do with a structure of Japanese discourse; it could be thought of as a linguistic unit delnarcated by wa, a Japanese topic particle, which may extend over several sentences. The segmentation works with discourse segments and makes use of coherence measure ba~scd on tfidf, a standard information retrieval measurement (Salton, 1988; IIearst, 1993). Experi,nents have been done with a Japanese newspaper corpus. It has been found that the present approach is quite sucecssfld in recovering articles fronl tile unstructured corpus.</Paragraph>
    <Paragraph position="1"> Introduction In this paper, we describe a method for discovering coherent texts from the unstructured corpus. The method is both linguistically and statistically motivated. It derives a linguistic motivation front the view that discourse consists of what we call discourse segments, minimal coherent mtits of discourse (Nomoto and Nitta, 1993), while statistically it is guided by ideas from intbrmatiou retrieval (Salton, 1988). Previous quantitative approaches to text segmentation (Hearst, 1993; Kozima, 1993; Youmans, 1991) have paid little attention to a statistically important structure that a discourse might have and detined it away ~Ls a lump of words or sentences. 1'art of our concern here is with explicating possible effects of a discourse segment on the quantitative structuring of discourse.</Paragraph>
    <Paragraph position="2"> In what follows, we will describe some important features about discourse segment and see how it can be incorporated into a statistical analysis of discourse. Also some comparison is made with other apl)roaches such as (Youmans, 1991), followed by discussion on the results of the present method.</Paragraph>
    <Paragraph position="3"> Theory of Discourse Segment The theory of discourse seF, ment (Nomoto a.nd Nitta, *2520 |tatoyama Saitama 350-03 Japan tel. +81-492-9C~6111 fax. +81-492-9(~-6006 1993) carries with it a set of empirical hypotheses about structure of Japanese discourse. Among them is the claim that Japanese discourse is constructed from a series of linguistics units called discourse segment. The discourse segment is thought of az a topic-comment structure, where a topic corresponds to the subject matter and a comment a discussion about it. In particular, Japanese haz a special way of marking the topic: by sutllxing it with a postpositionM particle wa. Thus in Japanese, a topic-comment structure takes the form: topic comment where &amp;quot;*&amp;quot; represents a word. The comment part could become quite long, extending over quite a few sentences (Mikami, 1960). Now Japanese provides for a variety of ways to mark off a topic-comment structure; the wamarking is one such and a typographical device such as a line- or a page-break is another. For the present discussiou, we take a discourse segment to bca block of sentences bounded by a text break and/or a wa-marked element.</Paragraph>
    <Paragraph position="4"> I T1 ~'1 5'2 &amp;quot;'' Sn \[ ~t~2 Sn-F1 Sn+2 ' &amp;quot;&amp;quot; Sn-Frn \],</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Discourse Segment Discourse Segment
</SectionTitle>
      <Paragraph position="0"> where &amp;quot;T&amp;quot; denotes a boundary marker, &amp;quot;S&amp;quot; a sentence, and &amp;quot;\[&amp;quot; a segment gap. For the semantics of a discourse segment, Nomoto and Nitta (1993) observes an interesting tendency that zero (elliptical) anapl,ora occurring within the segment do not refer across the segment boundary; that is, their references tend to be resolved internally 1 .</Paragraph>
      <Paragraph position="1"> Now we take a very simple view about tile gk)bal structure of discourse: syntactically, discourse is just lllerc and throughout we use 01 for a subject(NOMinative) zero; 02 for a object(ACCusative) zero; TOP for a topic case; I)AT fox&amp;quot; a dative(indirect object) c~me; PASS for a passive mor\].)heIIle. null I 'l'aro&lt;,&gt; -wa 01&lt;i&gt; rojin&lt;i &gt; -hi seki TOP old man DAT seat -wo yuzutte -ageta node, 01&lt;i&gt; 02&lt;j&gt; ACC give help because orei -wo iwar et,~. \[ thank say PASS &amp;quot;l\]ecause ~\]hro gave the old man a favor of 9ivim.l a seat, he thanked Taro.&amp;quot; Note that all the instances of 01 aa~d 02 have internal antecedents: Taro and rojin.</Paragraph>
      <Paragraph position="2">  a chronological juxtaposition of contiguous, disjoint blocks of sentences, each of which corresponds to a discourse segment; semantically, discourse is a set of anaphoric islauds set side by side. Thus a discourse should look like Figure 1, where G denotes a discourse segment. Fnrthermore, we do not intend the dis-</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="0" end_page="1146" type="metho">
    <SectionTitle>
D
</SectionTitle>
    <Paragraph position="0"> course structure to be anything close to the ones that rhetorical theories of discourse (tIovy, 1990; Mann and Tholnpson, 1987; Itobbs, 1979) claim it to be, or inteulional structure (Grosz and Sidner, 1986) ; indeed we do not assmne any functional relation, i.e. causation, elaboration, extension, etc., among the segments that constitute a discourse structure. The present theory is not so much about the rhetoric or the function of discourse as about the way anaphora are interpreted.</Paragraph>
    <Paragraph position="1"> It is quite possible that a set of discourse segments are not aggregated into a single discourse but may have diverse discourse groupings (Nomoto and Nitta, 1993).</Paragraph>
    <Paragraph position="2"> This happens when discourses are interleaved or embedded into some other. An interleaving or an embedding of discourse is often invoked by changes in narrative mode such as direct/indirect speech, quoting, or interruption; which will cue the reader/hearer to suspend a current discourse flow, start another, or resume the interrupted discourse.</Paragraph>
    <Paragraph position="3"> A Quantitative Structuring of</Paragraph>
    <Section position="1" start_page="0" end_page="1146" type="sub_section">
      <SectionTitle>
Discourse
Vector Space Model
</SectionTitle>
      <Paragraph position="0"> Formally, a discourse segment is represented as a lerm vector of the form: Gi = (gil, gi2, gi3 .... ,git) where a..qi represents a nominal occurrence in Gi. Ill the information retrieval terms, what happens here is that a discourse segment Gi is indexed with a set of terms gll through tit; namely, we characterize Gi with a set of indices gil,... ,tit. A term vector can either be binary, where each term in the w~'ctor takes 0 or 1, i.e., absence or presence, or weighted, where a term is assigned to a certain importance value. In general, the weighted indexing is preferable to the biuary indexing, as the latter policy is known to have problems with precision (Salton, 1988) 2. The weighting policy that we will adopt is known as If. idf It is an indicator of term importance Wij defined by: N wij .= t fij * log ~j 2 Precision measures the proportioi1 of correct items retrieved against the total number of retrieved items.</Paragraph>
      <Paragraph position="1"> where tf (term frequency) is the number of occurrences of a term Tj in a document Di; df (document frequency) is tile number of documents in a collection of N documents in which 7) occurs; and the importance, wljis given ~s the product of if and the inverse dffactor ,or idf, log N/dfi. With the tf idfpolicy, high-frequency terms that are scattered evenly over the entire documents collection are considered to be less important than those that are frequent but whose occurrences are concentrated in particnlar documents 3. Thus the tf.idfindexing favors rare words, which distinguish the documents more effectively than common words.</Paragraph>
      <Paragraph position="2"> With the indexing method in place, it is now possible to define the cohercnce between two term vectors. For term vectors X = (xl,x~,...,x~) and Y = (Yt, Y2,..., Yt), let the coherence be defined by:</Paragraph>
      <Paragraph position="4"> where w(xi) represents a tfidf weight assigned to the term xi. The measure is known as Dice coefficienl 4.</Paragraph>
      <Paragraph position="5"> Experiments Earlier quantitative approaches to text partitioning (Youmans, 1991; Kozima, 1993; Ilearst, 1993) work with an arbitrary block of words or sentences to determine a strneture of discourse. In contrast, we work with a block of discourse segments. It is straightforward to apply the tfidf to the analysis of discourse; one may just treat a block of discourse segments as a document unit by itself and then define the term frequency (t J), the document fi'equency (dJ), and the size of docmnents collcction (N), accordingly. Coherence would then be determined by the number of terms segment blocks share and tf.idf weights the terms carry.</Paragraph>
      <Paragraph position="6"> Thus one pair of blocks will become more cohesive than auother if the pair share more of the terms that arc locally frequent.</Paragraph>
      <Paragraph position="7"> The partitioning proceeds in two steps. We start with the following:  1. Collect all the nominal occurrences found in a corpus 5 2. Divide the collection into disjoint discourse seglrlents. null 3. Compare all pairs of adjacent blocks of discourse segments.</Paragraph>
      <Paragraph position="8"> 3 Precision depends on the size of docmnents coLLection; as the  collection get, s smaller in size, index ternls become less extensive and more discriminatory. The id\]factor could be dispensed with ill such cases.</Paragraph>
      <Paragraph position="10"> 4, Assign a coherence/similarity wdue to each pair.</Paragraph>
      <Paragraph position="11"> Next, we examine the coherence curve for hills and valleys and partition it manually. Valleys (viz, low coherence val,es) are likely to signal a potential break in a discourse tlow, whereas hills (viz, high coherence values) would signal local coherency, l,'igure 2 shows how a coherence curve lnight appear.</Paragraph>
      <Paragraph position="12"> Coherence is measured at every segment with a paired comparison window moving along the sequence of segments. Or more precisely, given a segment dj and a block size n, what we do is to compare a block Sl)am'ing dj-t~4-1 through dj and one sl)anning dj+l through dj+n-1. The lneasnrement is l)lotted at the jth position on the x-axis. If either of |;lie comparison</Paragraph>
      <Paragraph position="14"> windows is underiilled, no measurelnent will be made.</Paragraph>
      <Paragraph position="15"> The graph that the procedure gives out is smoothed (with an appropriate width) to bring out its global trends. The length of a single segment, i.e., noun counts, wtries from text to text, genre to gem:e, ranging from a few words (a junior high science book) to somewhere around 60 words (a newspaper editorial).</Paragraph>
      <Paragraph position="16"> We performed experiments on a tbur-week collection of editorials fi'om Niho, Keizai Shimbun, a ,lapanese economics newsl)al)er, which contains the total of 1111 sentences with some 10,000 nouns, and 556 discourse segments. The corpus was divided into segmental sets of nouns semi-autolnatically 6 (k)herence was measured ff)r each adjacent pair of segments, using  those that are suffixed to case particles such a.s t0 (C;ONmNCTIVI.;), de (LOcATIVF,/IN'rI'~UM~:NTAL), he (F:ImnCrlONA\[,), kara (SOUltCP,), nl (DATIVE), ere., or to &amp; particular form of verbM inflection (renyou-kei, i.e. intinitive); thus wa is treated as non-topical unless it occm* as a postpo~itlon to the bare noun.</Paragraph>
      <Paragraph position="17"> the Dice coefficient. It wa~s found that the block size of 10 segments yiehls good results. Figure 4 shows a coherence graph covering ahmlt a week's amount of the corpus, The graph is smoothed with tile width of 5.</Paragraph>
      <Paragraph position="18"> We see that article boundaries (vertical lines) coincide fairly well with major minima on the graph: with only one miss at 65, which falls on a paragraph boundary.</Paragraph>
      <Paragraph position="19">  Experinmnts with w~rious block sizes suggest that the choice of block size relates in some way to the struco ture of discourse; an increasing block size would extract a more global or general structure of discourse.</Paragraph>
      <Paragraph position="20"> Youmans (1991) has suggested a information measurement b~sed on word frequency. It is intended to measure the ehb and flow of 'new information' in disdeg coarse. The idea is simply to count the mmlber of new words introduced ow'x a moving interval and 1)roduce what he calls a vocabulary managemenl profile (VMI'), or lneasurements at intervals. Now given a discourse</Paragraph>
      <Paragraph position="22"> ours. Figure 5 shows the results of a VMP analysis for the same nominal collection as above. The interval is set to 300 words, or the average length of a paired window in the previous analysis. The y-axis corresponds to the number of new words (TYro,:) and the x-axis to an interval position (TOK~;N). As it tnrns out, the VMP fails to detect any significant pattern in the corpus.</Paragraph>
      <Paragraph position="23"> One of the problems with tim analysis has to do with its generality (Kozima, 1993); a text with many repetitions and/or limited vocabulary would yield a flattened VMP, which, of course, does not tell us much about its inner strncturings. Indeed, this could be the case with Figure 5. We suspect that the VMP scheme fares better with a short-term coherency than with a long-term or global coherency.</Paragraph>
    </Section>
    <Section position="2" start_page="1146" end_page="1146" type="sub_section">
      <SectionTitle>
Evaluation
</SectionTitle>
      <Paragraph position="0"> l,'igure 6 demonstrates the results of the Dice analysis on the nikkei collection of editorial articles. What we see here is a close correspondence between the Dice curve an(l the global discourse structure. Evaluation here simply consists of finding major minima on the graph and locating them on the list of those discourse segments which comprise the corpus. The procedure is performed by hand.</Paragraph>
      <Paragraph position="1"> Correspondences evaluation has been l)roblematical, since it requires human judgments o~ how discourse is structnred, whose reliability is yet to be demonstrated. It was decided here to use copious boundary indicators such as an article or paragraph break for evalnating matches between the Dice analysis and the discourse. For us, discourse structure reduces to just an orthographic structure 7.</Paragraph>
      <Paragraph position="2"> In the figure, article boundaries are marked by dashes. 7 out of 27 local minima are found to be incorrect, which puts the error rate at around 25%. We obtained similar results for the Jaccard and cosine coefficient. A possible improvement would include adjusting the document frequency (d\]) factor for index terms; the average df factor we had for the Nikkei corpns is around 1.6, which is so low as to be negligible s. ;'Yet, there is some evidence that an orthographic structure is liuguislically significant (Fujisawa et al., 1993; Nunberg, 19(00). 8IIowever, the averagc df factor would increase in proportion Another interesting possibilty is to use an alternative weighting policy, the weighled invet~se documenl frequency (Tokunaga and Iwayama, 1994). A widfvalue of a term is its frequency within the docmnent divided by its frequency throughout the entire document collection. &amp;quot;\['he widf policy is reported to have a marked advantage over the idffor the text categorization task.</Paragraph>
    </Section>
    <Section position="3" start_page="1146" end_page="1146" type="sub_section">
      <SectionTitle>
Recall and Precision
</SectionTitle>
      <Paragraph position="0"> As with the document analysis, the effectiveness of text segmentation appears to be dictated by recall/precision parameters where: number of correct boundaries retrieved Fecall = total number o/'correct boundaries number of correct boundaries retrieved precision = total number of boundaries retrieved A boundary here is meant to be a minimum on the coherence graph. Precision is strongly affected by the size of block or intervalg; a large-block segmentation yields less boundaries than a small-block segmentation. (Table 1). Experiments were made on the Nikkci cor- null aries retrieved.</Paragraph>
      <Paragraph position="1"> pus to examine the effccts of the block size parameter on text segmentation. The corpus was divided into equMly sized blocks, ranging from 5 to 35 words in length. The window size was kept to 10 blocks. Shown in Figure 7 are the results given in terms of recall and precision. Also, a partitioning task with discourse segments, whose length varies widely, is measured for recall and precision, and the result is represented as G. Averaging recall and precision values for each size gives to the growth of corpus size. It is likely, therefore, that with a  an ordering: 35&lt;25&lt;20&lt;5&lt;30&lt;15&lt;10&lt;(;.</Paragraph>
      <Paragraph position="2"> O ranks highest, whose average value, 0.66, is higher than any other. ~10' comes second (0.61) 1deg . (It is an interesting coincidence that the average length of dis= course segments is \[3.7 words.) The results demon: strate in quantitative terms the significance of diseotlrse segments.</Paragraph>
      <Paragraph position="3"> It is worth t)ointing out that l, he method here ix rather selectlw'~ about a level of granularity it detects, namely, that of a news article. It is possible, however, to have a much smaller granularity; as shown in Table 1, decreasing the block size would give a segmentation of a smaller granularity. Still, we chose not to work on fine-grained segmentations I)(;cause they lack a reliable evaluation metric ~ t.</Paragraph>
      <Paragraph position="4"> Conclusion In this l)aper, we haw; described a method for I)art,ition ing ,qll unstructured eorlms into coherent textual milts. We have adoptc~d the view that discourse consists of contiguous, non-overlal)ping discourse segments. We have referred to a vector sl)acc model for a statistical representation of discourse seglnellt. (Joherence b(&gt; tween segments is determined by the Dice coefficient with the If .idf term weighting.</Paragraph>
      <Paragraph position="5"> We have demonstrated in quantitative terms that the me.thod here is quite suceessfld in discovering articles fi'om the eort)us. An interesting question yet to be answered is how the corpus size affects the docul|mnt larger corpus, we might get, better results.</Paragraph>
      <Paragraph position="6"> degBlocks here a.re in|elided 1;o lll(}all minlntal l.extll;t |units into which a discourse is divided and fro' which c()hcrellCe is lttetmure(l. 1degIn general, recall is inversely proport.i(mate to precision; a high precision implies a low recall and vice versa.</Paragraph>
      <Paragraph position="7"> 11 Passonneau and l,itman (199.' 0 reports a psychological study on the htlnt;I.n reliability of discourse segmental.ion. frequency and coherence measurements. Another problem h~s to do with relating the present discussion to rhetorical analyses of discourse.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML