File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/w04-3210_relat.xml
Size: 2,118 bytes
Last Modified: 2025-10-06 14:15:45
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3210"> <Title>Automatic Paragraph Identification: A Study across Languages and Domains</Title> <Section position="3" start_page="0" end_page="0" type="relat"> <SectionTitle> 2 Related Work </SectionTitle> <Paragraph position="0"> Previous work has focused extensively on the task of automatic text segmentation whose primary goal is to divide individual texts into sub-topics. Despite their differences, most methods are unsupervised and typically rely on the distribution of words in a given text to provide cues for topic segmentation.2 Hearst's (1997) TextTiling algorithm, for example, determines sub-topic boundaries on the basis of term overlap in adjacent text blocks. In more recent work, Utiyama and Isahara (2001) combine a statistical segmentation model with a graph search algorithm to find the segmentation with the maximum probability. Beeferman et al. (1999) use supervised learning methods to infer boundaries between texts. They employ language models to detect topic shifts and combine them with cue word features.</Paragraph> <Paragraph position="1"> 2Due to lack of space we do not describe previous work in text segmentation here in detail; we refer the reader to Utiyama and Isahara (2001) and Pevzener and Hearst (2002) for a comprehensive overview.</Paragraph> <Paragraph position="2"> Our work differs from these previous approaches in that paragraphs do not always correspond to subtopics. While topic shifts often correspond to paragraph breaks, not all paragraph breaks indicate a topic change. Breaks between paragraphs are often inserted for other (not very well understood) reasons (see Stark (1988)). Therefore, the segment granularity is more fine-grained for paragraphs than for topics. An important advantage for methods developed for paragraph detection (as opposed to those developed for text-segmentation) is that training data is readily available, since paragraph boundaries are usually unambiguously marked in texts. Hence, supervised methods are &quot;cheap&quot; for this task.</Paragraph> </Section> class="xml-element"></Paper>