File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/p98-1100_abstr.xml
Size: 3,789 bytes
Last Modified: 2025-10-06 13:49:21
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1100"> <Title>Text Segmentation Using Reiteration and Collocation</Title> <Section position="1" start_page="0" end_page="614" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> A method is presented for segmenting text into subtopic areas. The proportion of related pairwise words is calculated between adjacent windows of text to determine their lexical similarity. The lexical cohesion relations of reiteration and collocation are used to identify related words. These relations are automatically located using a combination of three linguistic features: word repetition, collocation and relation weights. This method is shown to successfully detect known subject changes in text and corresponds well to the segmentations placed by test subjects.</Paragraph> <Paragraph position="1"> Introduction Many examples of heterogeneous data can be found in daily life. The Wall Street Journal archives, for example, consist of a series of articles about different subject areas. Segmenting such data into distinct topics is useful for information retrieval, where only those segments relevant to a user's query can be retrieved. Text segmentation could also be used as a pre-processing step in automatic summarisation. Each segment could be summarised individually and then combined to provide an abstract for a document.</Paragraph> <Paragraph position="2"> Previous work on text segmentation has used term matching to identify clusters of related text. Salton and Buckley (1992) and later, Hearst (1994) extracted related text portions by matching high frequency terms. Yaari (1997) segmented text into a hierarchical structure, identifying sub-segments of larger segments. Ponte and Croft (1997) used word co-occurrences to expand the number of terms for matching. Reynar (1994) compared all words across a text rather than the more usual nearest neighbours. A problem with using word repetition is that inappropriate matches can be made because of the lack of contextual information (Salton et al., 1994). Another approach to text segmentation is the detection of semantically related words.</Paragraph> <Paragraph position="3"> Hearst (1993) incorporated semantic information derived from WordNet but in later work reported that this information actually degraded word repetition results (Hearst, 1994). Related words have been located using spreading activation on a semantic network (Kozima, 1993), although only one text was segmented. Another approach extracted semantic information from Roget's Thesaurus (RT). Lexical cohesion relations (Halliday and Hasan, 1976) between words were identified in RT and used to construct lexical chains of related words in five texts (Morris and Hirst, 1991). It was reported that the lexical chains closely correlated to the intentional structure (Grosz and Sidner, 1986) of the texts, where the start and end of chains coincided with the intention ranges. However, RT does not capture all types of lexical cohesion relations. In previous work, it was found that collocation (a lexical cohesion relation) was under-represented in the thesaurus.</Paragraph> <Paragraph position="4"> Furthermore, this process was not automated and relied on subjective decision making.</Paragraph> <Paragraph position="5"> Following Morris and Hirst's work, a segmentation algorithm was developed based on identifying lexical cohesion relations across a text. The proposed algorithm is fully automated, and a quantitative measure of the association between words is calculated. This algorithm utilises linguistic features additional to those captured in the thesaurus to identify the other types of lexical cohesion relations that can exist in text.</Paragraph> </Section> class="xml-element"></Paper>