File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-0304_intro.xml
Size: 3,922 bytes
Last Modified: 2025-10-06 14:06:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-0304"> <Title>Text Segmentation Using Exponential Models*</Title> <Section position="3" start_page="35" end_page="35" type="intro"> <SectionTitle> 2 Some Previous Work </SectionTitle> <Paragraph position="0"> In this section we very briefly discuss some previous approaches to the text segmentation problem.</Paragraph> <Section position="1" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.1 Text tiling </SectionTitle> <Paragraph position="0"> The Te~ctTiling algorithm, introduced by Hearst (Hearst, 1994), segments expository texts into multiple paragraphs of coherent discourse units. A cosine measure is used to gauge the similarity between constant-size blocks of morphologically analyzed tokens. First-order rates of change of this measure are then calculated to decide the placement of boundaries between blocks, which are then adjusted to coincide with the paragraph segmentation, provided as input to the algorithm. This approach leverages the observation that text segments are dense with repeated content words. Relying on this fact, however, may limit precision because the repetition of concepts within a document is more subtle than can be recognized by only a &quot;bag of words&quot; tokenizer and morphological filter.</Paragraph> <Paragraph position="1"> Word pairs other than &quot;self-triggers,&quot; for example, can be discovered automatically from training data using the techniques of mutual information employed by our language model. Furthermore, Hearst's approach segments at the paragraph level, which may be too coarse for applications like information retrieval on transcribed or automatically recognized spoken documents, in which paragraph boundaries are not known.</Paragraph> </Section> <Section position="2" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.2 Lexical cohesion </SectionTitle> <Paragraph position="0"> (Kozima, 1993) employs a &quot;lexical cohesion profile&quot; to keep track of the semantic cohesiveness of words in a text within a fixed-length window. In contrast to Hearst's focus on strict repetition, Kozima uses a semantic network to provide knowledge about related word pairs. Lexical cohesiveness between two words is calculated in the network by &quot;activating&quot; the node for one word and observing the &quot;activity value&quot; at the other word after some number of iterations of &quot;spreading activation&quot; between nodes. The network is trained automatically using a language-specific knowledge source (a dictionary of definitions). Kozima generalizes lexical cohesiveness to apply to a window of text, and plots the cohesiveness of successive text windows in a document, identifying the valleys in the measure as segment boundaries.</Paragraph> <Paragraph position="1"> A graphically motivated segmentation technique called dotplotting is offered in (Reynar, 1994). This technique uses a simplified notion of lexical cohesion, depending exclusively on word repetition to find tight regions of topic similarity.</Paragraph> </Section> <Section position="3" start_page="35" end_page="35" type="sub_section"> <SectionTitle> 2.3 Decision trees </SectionTitle> <Paragraph position="0"> (Litman and Passonneau, 1995) presents an algorithm that uses decision trees to combine multiple linguistic features extracted from corpora of spoken text, including prosodic and lexical cues. The decision tree algorithm, like ours, chooses from a space of candidate features, some of which are similar to our vocabulary questions. The set of candidate questions in Litman and Passonneu's approach, however, is lacking in features related to &quot;lexical cohesion.&quot; In our work we incorporate such features by using a pair of language models, as described below.</Paragraph> </Section> </Section> class="xml-element"></Paper>