File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/98/p98-2244_intro.xml
Size: 2,294 bytes
Last Modified: 2025-10-06 14:06:40
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2244"> <Title>Optimal Multi-Paragraph Text Segmentation by Dynamic Programming</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Electronic full-text documents and digital libraries make the utilization of texts much more effective than before; yet, they pose new problems and requirements. For example, document retrieval based on string searches typically returns either the whole document or just the occurrences of the searched words. What the user often is after, however, is microdocument: a part of the document that contains the occurrences and is reasonably self-contained.</Paragraph> <Paragraph position="1"> Microdocuments can be created by utilizing lexical cohesion (term repetition and semantic relations) present in the text. There exist several methods of calculating a similarity curve, or a sequence of similarity values, representing the lexical cohesion of successive constituents (such as paragraphs) of text (see, e.g., (Hearst, 1994; Hearst, 1997; Kozima, 1993; Morris and Hirst, 1991; Yaari, 1997; Youmans, 1991)). Methods for deciding the locations of fragment boundaries are, however, not that common, and those that exist are often rather heuristic in nature.</Paragraph> <Paragraph position="2"> To evaluate our fragmentation method, to be explained in Section 2, we calculate the paragraph similarities as follows. We employ stemming, remove stopwords, and count the frequencies of the remaining words, i.e., terms. Then we take a pre-defined number, e.g., 50, of the most frequent terms to represent the paragraph, and count the similarity using the cosine coefficient (see, e.g., (Salton, 1989)). Furthermore, we have applied a sliding window method: instead of just one paragraph, several paragraphs on both sides of each paragraph boundary are considered. The paragraph vectors are weighted based on their distance from the boundary in question with immediate paragraphs having the highest weight. The benefit of using a larger window is that we can smooth the effect of short paragraphs and such, perhaps example-type, paragraphs that interrupt a chain of coherent paragraphs.</Paragraph> </Section> class="xml-element"></Paper>