File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/98/p98-1065_relat.xml
Size: 1,788 bytes
Last Modified: 2025-10-06 14:16:05
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1065"> <Title>Thematic segmentation of texts: two methods for two kinds of texts</Title> <Section position="4" start_page="395" end_page="395" type="relat"> <SectionTitle> 8. Related works </SectionTitle> <Paragraph position="0"> Without taking into account the collocation network, the methods described above rely on the same principles as Hearst (1997) and Nomoto and Nitta (1994). Although Hearst considers that paragraph breaks are sometimes invoked only for lightening the physical appearance of texts, we have chosen paragraphs as basic units because they are more natural thematic units than somewhat arbitrary sets of words. We assume that paragraph breaks that indicate topic changes are always present in texts. Those which are set for visual reasons are added between them and the segmentation algorithm is able to join them again. Of course, the size of actual paragraphs are sometimes irregular. So their comparison result is less reliable. But the collocation network in the second method tends to solve this problem by homogenizing the paragraph representation.</Paragraph> <Paragraph position="1"> As in Kozima (1993), the second method exploits lexical cohesion to segment texts, but in a different way. Kozima's approach relies on computing the lexical cohesiveness of a window of words by spreading activation into a lexical network built from a dictionary. We think that this complex method is specially suitable for segmenting small parts of text but not large texts. First, it is too expensive and second, it is too precise to clearly show the major thematic shifts. In fact, Kozima's method and ours do not take place at the same granularity level and so, are complementary.</Paragraph> </Section> class="xml-element"></Paper>