File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1100_metho.xml
Size: 3,876 bytes
Last Modified: 2025-10-06 14:14:58
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1100"> <Title>Text Segmentation Using Reiteration and Collocation</Title> <Section position="4" start_page="614" end_page="615" type="metho"> <SectionTitle> 3 Proposed Segmentation Algorithm </SectionTitle> <Paragraph position="0"> The proposed segmentation algorithm compares adjacent windows of sentences and determines their lexical similarity. A window size of three sentences was found to produce the best results.</Paragraph> <Paragraph position="1"> Multiple sentences were compared because calculating lexical similarity between words is too fine (Rotondo, 1984) and between individual sentences is unreliable (Salton and Buckley, 1991). Lexical similarity is calculated for each window comparison based on the proportion of related words, and is given as a normalised score. Word repetitions are identified between identical words and words derived from the same stem.</Paragraph> <Paragraph position="2"> Collocations are located by looking up word pairs in the collocation lexicon. Relation weights are calculated between pairwise words according to their location in RT. The lexical similarity score indicates the amount of lexical cohesion demonstrated by two windows. Scores plotted on a graph show a series of peaks (high scores) and troughs (low scores). Low scores indicate a weak level of cohesion. Hence, a trough signals a potential subject change and texts can be segmented at these points.</Paragraph> </Section> <Section position="5" start_page="615" end_page="615" type="metho"> <SectionTitle> 4 Experiment 1: Locating Subject Change </SectionTitle> <Paragraph position="0"> An investigation was conducted to determine whether the segmentation algorithm could reliably locate subject change in text.</Paragraph> <Paragraph position="1"> Method: Seven topical articles of between 250 to 450 words in length were extracted from the World Wide Web. A total of 42 texts for test data were generated by concatenating pairs of these articles. Hence, each generated text consisted of two articles. The transition from the first article to the second represented a known subject change point.</Paragraph> <Paragraph position="2"> Previous work has identified the breaks between concatenated texts to evaluate the performance of text segmentation algorithms (Reynar, 1994; Stairmand, 1997). For each text, the troughs placed by the segmentation algorithm were compared to the location of the known subject change point in that text. An error margin of one sentence either side of this point, determined by empirical analysis, was allowed.</Paragraph> <Paragraph position="3"> Results: Table 1 gives the results for the comparison of the troughs placed by the segmentation algorithm to the known subject change points.</Paragraph> <Paragraph position="4"> linguistic feature troughs placed subject change points located average I std. dev. (out of 42 poss.) using different linguistic features.</Paragraph> <Paragraph position="5"> Discussion: The segmentation algorithm using the linguistic features word repetition and collocation in combination achieved the best result. A total of 41 out of a possible 42 known subject change points were identified from the least number of troughs placed per text (7.1). For the text where the known subject change point went undetected, a total of three troughs were placed at sentences 6, 11 and 18. The subject change point occurred at sentence 13, just two sentences after a predicted subject change at sentence 11.</Paragraph> <Paragraph position="6"> In this investigation, word repetition alone achieved better results than using either collocation or relation weights individually. The combination of word repetition with another linguistic feature improved on its individual result, where less troughs were placed per text.</Paragraph> </Section> class="xml-element"></Paper>