File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/a00-2004_metho.xml

Size: 15,008 bytes

Last Modified: 2025-10-06 14:07:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="A00-2004">
  <Title>Advances in domain independent linear text segmentation</Title>
  <Section position="4" start_page="0" end_page="26" type="metho">
    <SectionTitle>
3 Algorithm
</SectionTitle>
    <Paragraph position="0"> Our segmentation algorithm takes a list of tokenized sentences as input. A tokenizer (Grefenstette and Tapanainen, 1994) and a sentence boundary disambiguation algorithm (Palmer and Hearst, 1994; Reynar and Ratnaparkhi, 1997) or EAGLE (Reynar et al., 1997) may be used to convert a plain text document into the acceptable input format.</Paragraph>
    <Section position="1" start_page="0" end_page="26" type="sub_section">
      <SectionTitle>
3.1 Similarity measure
</SectionTitle>
      <Paragraph position="0"> Punctuation and uninformative words are removed from each sentence using a simple regular expression pattern mateher and a stopword list. A stemming algorithm (Porter, 1980) is then applied to the remaining tokens to obtain the word stems. A dictionary of word stem frequencies is constructed for each sentence. This is represented as a vector of frequency counts.</Paragraph>
      <Paragraph position="1"> Let fi,j denote the frequency of word j in sentence i. The similarity between a pair of sentences x,y  is computed using the cosine measure as shown in equation 1. This is applied to all sentence pairs to generate a similarity matrix.</Paragraph>
      <Paragraph position="3"> Figure 1 shows an example of a similarity matrix ~ .</Paragraph>
      <Paragraph position="4"> High similarity values are represented by bright pixels. The bottom-left and top-right pixel show the self-similarity for the first and last sentence, respectively. Notice the matrix is symmetric and contains bright square regions along the diagonal. These regions represent cohesive text segments.</Paragraph>
      <Paragraph position="5"> Each value in the similarity matrix is replaced by its rank in the local region. The rank is the number of neighbouring elements with a lower similarity value. Figure 2 shows an example of image ranking using a 3 x 3 rank mask with output range {0, 8}.</Paragraph>
      <Paragraph position="6"> For segmentation, we used a 11 x 11 rank mask. The output is expressed as a ratio r (equation 2) to circumvent normalisation problems (consider the cases when the rank mask is not contained in the image).</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.2 Ranking
</SectionTitle>
      <Paragraph position="0"> For short text segments, the absolute value of sire(x, y) is unreliable. An additional occurrence of a common word (reflected in the numerator) causes a disproportionate increase in sim(x,y) unless the denominator (related to segment length) is large.</Paragraph>
      <Paragraph position="1"> Thus, in the context of text segmentation where a segment has typically &lt; 100 informative tokens, one can only use the metric to estimate the order of similarity between sentences, e.g. a is more similar to b than c.</Paragraph>
      <Paragraph position="2"> Furthermore, language usage varies throughout a document. For instance, the introduction section of a document is less cohesive than a section which is about a particular topic. Consequently, it is inappropriate to directly compare the similarity values from different regions of the similarity matrix.</Paragraph>
      <Paragraph position="3"> In non-parametric statistical analysis, one compares the rank of data sets when the qualitative behaviour is similar but the absolute quantities are unreliable. We present a ranking scheme which is an adaptation of that described in (O'Neil and Denos,  the image features.</Paragraph>
      <Paragraph position="4"> # of elements with a lower value</Paragraph>
      <Paragraph position="6"> To demonstrate the effect of image ranking, the process was applied to the matrix shown in figure 1 to produce figure 32 . Notice the contrast has been improved significantly. Figure 4 illustrates the more subtle effects of our ranking scheme, r(x) is the rank (1 x 11 mask) of f(x) which is a sine wave with decaying mean, amplitude and frequency (equation</Paragraph>
    </Section>
    <Section position="3" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.3 Clustering
</SectionTitle>
      <Paragraph position="0"> The final process determines the location of the topic boundaries. The method is based on Reynar's maximisation algorithm (Reynar, 1998; Helfman, 1996; Church, 1993; Church and Helfman, 1993). A text segment is defined by two sentences i,j (inclusive).</Paragraph>
      <Paragraph position="1"> This is represented as a square region along the diagonal of the rank matrix. Let si,j denote the sum of the rank values in a segment and aij = (j - i + 1) 2 be the inside area. B = {bl,...,bm} is a list of m (:oherent text segments, sk and ak refers to the sum of rank and area of segment k in B. D is the inside density of B (see equation 4).</Paragraph>
      <Paragraph position="3"> To initialise the process, the entire document is placed in B as one coherent text segment. Each step of the process splits one of the segments in B. The split point is a potential boundary which maximises D. Figure 5 shows a working example.</Paragraph>
      <Paragraph position="4"> The number of segments to generate, m, is determined automatically. D (n) is the inside density of n segments and 5D (n) -- D (n) -D (n-l) is the gradient.</Paragraph>
      <Paragraph position="5"> For a document with b potential boundaries, b steps of divisive clustering generates (D (1), ..., D (b+l)} and {SD(2),...,SD (b+l)} (see figure 6 and 7). An unusually large reduction in 5D suggests the optiinal clustering has been obtained 3 (see n = 10 in</Paragraph>
    </Section>
    <Section position="4" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
3.4 Speed optimisation
</SectionTitle>
      <Paragraph position="0"> The running time of each step is dominated by the computation of sk. Given si,j is constant, our algorithm pre-computes all the values to improve speed performance. The procedure computes the values along diagonals, starting from the main diagonal and  works towards the corner. The method has a com-Ln2 Let refer to the rank value plexity of order 12 * ri,j in the rank matrix R and S to the sum of rank matrix. Given R of size n x n, S is computed in three steps (see equation 5). Figure 8 shows the result of applying this procedure to the rank matrix in figure</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="26" end_page="29" type="metho">
    <SectionTitle>
5.
4 Evaluation
</SectionTitle>
    <Paragraph position="0"> The definition of a topic segment ranges from complete stories (Allan et al., 1998) to summaries (Ponte and Croft, 1997). Given the quality of an algorithm is task dependent, the following experiments focus on the relative performance. Our evaluation strategy is a variant of that described in (Reynar, 1998, 71-73) and the TDT segmentation task (Allan et al., 1998). We assume a good algorithm is one that finds the most prominent topic boundaries.</Paragraph>
    <Section position="1" start_page="26" end_page="26" type="sub_section">
      <SectionTitle>
4.1 Experiment procedure
</SectionTitle>
      <Paragraph position="0"> An artificial test corpus of 700 samples is used to assess the accuracy and speed performance of segmentation algorithms. A sample is a concatenation of ten text segments. A segment is the first n sentences of a randomly selected document from the Brown corpus 4. A sample is characterised by the range of n. The corpus was generated by an automatic procedure 5. Table 1 presents the corpus statistics. null I Range of n - - 11 I 3- 11 400 I 310: I 610: I 9100 I \[ # samples  1. Si, i -~- ri,i</Paragraph>
      <Paragraph position="2"/>
      <Paragraph position="4"> p(misslref, hyp, diff, k)p(diffiref, k)+ (6) p(falref, hyp, same, k)p(samelref , k) Speed performance is measured by the average number of CPU seconds required to process a test sample 6. Segmentation accuracy is measured by th(,. error metric (equation 6, fa --+ false alarms) 1)roposed in (Beeferman et al., 1999). Low error probability indicates high accuracy. Other performanc(; measures include the popular precision and recall metric (PR) (Hearst, 1994), fuzzy PR (Reynar, 1998) and edit distance (Ponte and Croft, 1997). The l)roblems associated with these metrics are discussed in (Beeferman et al., 1999).</Paragraph>
    </Section>
    <Section position="2" start_page="26" end_page="29" type="sub_section">
      <SectionTitle>
4.2 Experiment 1 - Baseline
</SectionTitle>
      <Paragraph position="0"> Five degenerate algorithms define the baseline for the experiments. B,, does not propose any boundaries. Ba reports all potential boundaries as real boundaries. B,. partitions the sample into regular segments. B(r,?) randomly selects any number of  boundaries as real boundaries. B(r,b ) randomly selects b boundaries as real boundaries.</Paragraph>
      <Paragraph position="1"> The accuracy of the last two algorithms are compuWA analytically. We consider the status of m potential bomldaries as a bit string (1 --+ topic boundary). The terms p(miss) and p(fa) in equation 6 corresponds to p(samelk ) and p(difflk ) = 1-p(same\[k). Equatioll 7, 8 and 9 gives the general form of p(samelk ), B(.,.?) and B(r,b), respectively 7.</Paragraph>
      <Paragraph position="2"> Table 2 presents the experimental results. The values in row two and three, four and five are not actually the same. However, their differences are insignificant according to the Kolmogorov-Smirnov,</Paragraph>
    </Section>
    <Section position="3" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.3 Experiment 2 - TextTiling
</SectionTitle>
      <Paragraph position="0"> We compare three versions of the TextTiling algorithm (Hearst, 1994). H94(c,d) is Hearst's C implementation with default parameters. H94(c,r) uses the recommended parameters k = 6, w = 20.</Paragraph>
      <Paragraph position="1"> H94(js) is my implementation of the algorithm.</Paragraph>
      <Paragraph position="2"> Experimental result (table 3) shows H94(c,a) and H94(~,r) are more accurate than H94(js). We suspect this is due to the use of a different stopword list and stemming algorithm.</Paragraph>
    </Section>
    <Section position="4" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.4 Experiment 3 - DotPlot
</SectionTitle>
      <Paragraph position="0"> Five versions of Reynar's optimisation algorithm (Reynar, 1998) were evaluated. R98 and R98(min) are exact implementations of his maximisation and minimisation algorithm. R98(~,~o~) is my version of the maximisation algorithm which uses the cosine coefficient instead of dot density for measuring similarity. It incorporates the optimisations described  TextTiling.</Paragraph>
      <Paragraph position="1"> in section 3.4. R98(m,dot) is the modularised version of R98 for experimenting with different similarity measures.</Paragraph>
      <Paragraph position="2"> R98(m,s,) uses a variant of Kozima's semantic similarity measure (Kozima, 1993) to compute block similarity. Word similarity is a function of word co-occurrence statistics in the given document. Words that belong to the same sentence are considered to be related. Given the co-occurrence frequencies f(wi, wj), the transition probability matrix t is computed by equation 10. Equation 11 defines our spread activation scheme, s denotes the word similarity matrix, x is the number of activation steps and norm(y) converts a matrix y into a transition matrix, x = 5 was used in the experiment.</Paragraph>
      <Paragraph position="3"> y(w ,wj) (10) t ,j = p(wj Iw ) = Ej s=norm(~t')i=l (11) Experimental result (table 4) shows the cosine co-efficient and our spread activation method improved segmentation accuracy. The speed optimisations significantly reduced the execution time.</Paragraph>
    </Section>
    <Section position="5" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.5 Experiment 4 - Segmenter
</SectionTitle>
      <Paragraph position="0"> We compare three versions of Segmenter (Kan et al., 1998). K98(B) is the original Perl implementation of  the algoritlun (version 1.6). K98(j) is my implementation of the algorithm, K98(j,a) is a version of K98(j) which uses a document specific chain breaking strategy. The distribution of link distances are used to identify unusually long links. The threshold is a function # + c x vf5 of the mean # and variance u. We found c = 1 works well in practice.</Paragraph>
      <Paragraph position="1"> Table 5 summarises the experimental results.</Paragraph>
      <Paragraph position="2"> K98(p) performed significantly better than K98g,,).</Paragraph>
      <Paragraph position="3"> This is due to the use of a different part-of-speech tagger and shallow parser. The difference in speed is largely due to the programming languages and term clustering strategies. Our chain breaking strategy improved accuracy (compare K98(j) with K98(j,~)).</Paragraph>
    </Section>
    <Section position="6" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.6 Experiment 5 - Our algorithm, C99
</SectionTitle>
      <Paragraph position="0"> Two versions of our algorithm were developed, C99 and C99(b). The former is an exact implementation of the algorithm described in this paper. The latter is given the expected number of topic segments for fair comparison with R98. Both algorithms used a 11 x 11 ranking mask.</Paragraph>
      <Paragraph position="1"> The first experiment focuses on the impact of our automatic termination strategy on C99(~) (table 6).</Paragraph>
      <Paragraph position="2"> C99(b) is marginally more accurate than C99. This indicates our automatic termination strategy is effective but not optimal. The minor reduction in speed performance is acceptable.</Paragraph>
      <Paragraph position="3">  our algorithm, C99.</Paragraph>
      <Paragraph position="4"> The second experiment investigates the effect of different ranking mask size on the performance of C99 (table 7). Execution time increases with mask size. A 1 x 1 ranking mask reduces all the elements in the rank matrix to zero. Interestingly, the increase in ranking mask size beyond 3 x 3 has insignificant effect on segmentation accuracy. This suggests the use of extrema for clustering has a greater impact on accuracy than linearising the similarity scores (figure</Paragraph>
    </Section>
    <Section position="7" start_page="29" end_page="29" type="sub_section">
      <SectionTitle>
4.7 Summary
</SectionTitle>
      <Paragraph position="0"> Experimental result (table 8) shows our algorithm C99 is more accurate than existing algorithms.</Paragraph>
      <Paragraph position="1"> A two-fold increase in accuracy and seven-fold increase in speed was achieved (compare C99(b) with R98). If one disregards segmentation accuracy, H94 has the best algorithmic performance (linear). C99, K98 and R98 are all polynomial time algorithms.</Paragraph>
      <Paragraph position="2"> The significance of our results has been confirmed</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML