File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-3009_intro.xml
Size: 4,328 bytes
Last Modified: 2025-10-06 14:01:41
<?xml version="1.0" standalone="yes"?> <Paper uid="N03-3009"> <Title>Spoken and Written News Story Segmentation using Lexical Chains</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text segmentation can be defined as the automatic identification of boundaries between distinct textual units (segments) in a textual document. The aim of early segmentation research was to model the discourse structure of a text, thus focusing on the detection of fine-grained topic shifts, at a clausal, sentence or passage/subtopic level (Hearst 1997). More recently with the introduction of the TDT initiative (Allan et al. 1998) segmentation research has concentrated on the detection of coarse-grained topic shifts resulting in the identification of story boundaries in news feeds. In particular, unsegmented broadcast news streams represent a challenging real-world application for text segmentation approaches, since the success of other tasks such as topic tracking or first story detection depend heavily on the correct identification of distinct and non-overlapping news stories. Most approaches to story segmentation use either Information Extraction techniques (cue phrase extraction), techniques based on lexical cohesion analysis or a combination of both (Reynar 1998; Beeferman et al. 1999). More recently promising results have also been achieved though the use of Hidden Markov modeling techniques, which are commonly used in speech recognition applications (Mulbregt et al. 1999).</Paragraph> <Paragraph position="1"> In this paper we focus on lexical cohesion based approaches to story segmentation. Lexical cohesion is one element of a broader linguistic device called cohesion which is describe as the textual quality responsible for making the elements of a text appear unified or connected. More specifically, lexical cohesion 'is the cohesion that arises from semantic relationships between words' (Morris, Hirst 1991). With respect to segmentation, an analysis of lexical cohesion can be used to indicate portions of text that represent single topical units or segments i.e. they contain a high number of semantically related words. Almost all approaches to lexical cohesion based segmentation examine patterns of syntactic repetition in the text e.g. (Reynar 1998; Hearst 1997; Choi 2000). However, there are four additional types of lexical cohesion present in text: synonymy (car, automobile), specialization/generalization (horse, stallion), part-whole/whole-part (politicians, government) and statistical co-occurrences (Osama bin Laden, World Trade Center). Lexical chaining based approaches to text segmentation, on the other hand, analyse all aspects of lexical cohesion in text. Lexical chains are defined as groups of semantically related words that represent the lexical cohesive structure of a text e.g. {flower, petal, rose, garden, tree}. In our lexical chaining implementation, words are clustered based on the existence of statistical relationships and lexicographical associations (provided by the WordNet online thesaurus) between terms in a text.</Paragraph> <Paragraph position="2"> There have been three previous attempts to tackle text segmentation using lexical chains. The first by Okumara and Honda (1994) involved an evaluation based on five Japanese texts, the second by Stairmand (1997) used twelve general interest magazine articles and the third by Kan et al. (1998) used fifteen Wall Street Journal and five Economist articles. All of these attempts focus on sub-topic rather than story segmentation. In contrast, this paper investigates the usefulness of lexical chains as a technique for determining story segments in spoken and written broadcast news streams. In Section 2, we explain how this technique can be refined to address story segmentation. In Section 3, we compare the segmentation performance of our lexical chaining algorithm with two other well known lexical cohesion based approaches to segmentation; namely TextTiling (Hearst 1997) and C99 (Choi 2000). Finally we examine the grammatical differences between written and spoken news media and show how these differences can be utilized to improve spoken transcript segmentation accuracy. null</Paragraph> </Section> class="xml-element"></Paper>