File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/00/a00-2004_intro.xml
Size: 2,189 bytes
Last Modified: 2025-10-06 14:00:42
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2004"> <Title>Advances in domain independent linear text segmentation</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Background </SectionTitle> <Paragraph position="0"> Existing work falls into one of two categories, lexical cohesion methods and multi-source methods (Yaari, 1997). The former stem from the work of Halliday and Hasan (Halliday and Hasan, 1976). They proposed that text segments with similar vocabulary are likely to be part of a coherent topic segment.</Paragraph> <Paragraph position="1"> hnplementations of this idea use word stem repetition (Youmans, 1991; Reynar, 1994; Ponte and Croft, 1997), context vectors (Hearst, 1994; Yaari, 1997; Kaufmann, 1999; Eichmann et al., 1999), entity repetition (Kan et al., 1998), semantic similarity (Morris and Hirst, 1991; Kozima, 1993), word distance model (Beeferman et al., 1997a) and word frequency model (Reynar, 1999) to detect cohesion.</Paragraph> <Paragraph position="2"> Methods for finding the topic boundaries include sliding window (Hearst, 1994), lexical chains (Morris, 1988; Kan et al., 1998), dynamic programming (Ponte and Croft, 1997; Heinonen, 1998), agglomerative clustering (Yaari, 1997) and divisive clustering (Reynar, 1994). Lexical cohesion methods are typically used for segmenting written text in a collection to improve information retrieval (Hearst, 1994; Reynat, 1998).</Paragraph> <Paragraph position="3"> Multi-source methods combine lexical cohesion with other indicators of topic shift such as cue phrases, prosodic features, reference, syntax and lexical attraction (Beeferman et al., 1997a) using decision trees (Miike et al., 1994; Kurohashi and Nagao, 1994; Litman and Passonneau, 1995) and probabilistic models (Beeferman et al., 1997b; Hajime et al., 1998; Reynar, 1998). Work in this area is largely motivated by the topic detection and tracking (TDT) initiative (Allan et al., 1998). The focus is on the segmentation of transcribed spoken text and broadcast news stories where the presentation format and regular cues can be exploited to improve accuracy.</Paragraph> </Section> class="xml-element"></Paper>