File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/01/j01-1002_intro.xml

Size: 8,243 bytes

Last Modified: 2025-10-06 14:01:04

<?xml version="1.0" standalone="yes"?>
<Paper uid="J01-1002">
  <Title>Integrating Prosodic and Lexical Cues for Automatic Topic Segmentation</Title>
  <Section position="3" start_page="0" end_page="33" type="intro">
    <SectionTitle>
2. Previous Work
</SectionTitle>
    <Paragraph position="0"> Work on topic segmentation is generally based on two broad classes of cues. On the one hand, one can exploit the fact that topics are correlated with topical content-word usage, and that global shifts in word usage are indicative of changes in topic. Quite independently, discourse cues, or linguistic devices such as discourse markers, cue phrases, syntactic constructions, and prosodic signals are employed by speakers (or writers) as generic indicators of endings or beginnings of topical segments. Interestingly, most previous work has explored either one or the other type of cue, but only rarely both. In automatic segmentation systems, word usage cues are often captured by statistical language modeling and information retrieval techniques. Discourse cues, on the other hand, are typically modeled with rule-based approaches or classifiers derived by machine learning techniques (such as decision trees).</Paragraph>
    <Section position="1" start_page="0" end_page="32" type="sub_section">
      <SectionTitle>
2.1 Approaches Based on Word Usage
</SectionTitle>
      <Paragraph position="0"> Most automatic topic segmentation work based on text sources has explored topical word usage cues in one form or other. Kozima (1993) used mutual similarity of words in a sequence of text as an indicator of text structure. Reynar (1994) presented a method that finds topically similar regions in the text by graphically modeling the distribution  Ttir, Hakkani-Tiir, Stolcke, and Shriberg Integrating Prosodic and Lexical Cues of word repetitions. The method of Hearst (1994, 1997) uses cosine similarity in a word vector space as an indicator of topic similarity.</Paragraph>
      <Paragraph position="1"> More recently, the U.S. Defense Advanced Research Projects Agency (DARPA) initiated the Topic Detection and Tracking (TDT) program to further the state of the art in finding and following new topics in a stream of broadcast news stories. One of the tasks in the TDT effort is segmenting a news stream into individual stories. Several of the participating systems rely essentially on word usage: Yamron et al. (1998) model topics with unigram language models and their sequential structure with hidden Markov models (HMMs). Ponte and Croft (1997) extract related word sets for topic segments with the information retrieval technique of local context analysis, and then compare the expanded word sets.</Paragraph>
    </Section>
    <Section position="2" start_page="32" end_page="32" type="sub_section">
      <SectionTitle>
2.2 Approaches Based on Discourse and Combined Cues
</SectionTitle>
      <Paragraph position="0"> Previous work on both text and speech has found that cue phrases or discourse particles (items such as now or by the way), as well as other lexical cues, can provide valuable indicators of structural units in discourse (Grosz and Sidner 1986; Passonneau and Litman 1997, among others).</Paragraph>
      <Paragraph position="1"> In the TDT framework, the UMass HMM approach described in Allan et al. (1998) uses an HMM that models the initial, middle, and final sentences of a topic segment, capitalizing on discourse cue words that indicate beginnings and ends of segments.</Paragraph>
      <Paragraph position="2"> Aligning the HMM to the data amounts to segmenting it.</Paragraph>
      <Paragraph position="3"> Beeferman, Berger, and Lafferty (1999) combined a large set of automatically selected lexical discourse cues in a maximum entropy model. They also incorporated topical word usage into the model by building two statistical language models: one static (topic independent) and one that adapts its word predictions based on past words. They showed that the log likelihood ratio of the two predictors behaves as an indicator of topic boundaries, and can thus be used as an additional feature in the exponential model classifier.</Paragraph>
    </Section>
    <Section position="3" start_page="32" end_page="33" type="sub_section">
      <SectionTitle>
2.3 Approaches Using Prosodic Cues
</SectionTitle>
      <Paragraph position="0"> Prosodic cues form a subset of discourse cues in speech, reflecting systematic duration, pitch, and energy patterns at topic changes and related locations of interest. A large literature in linguistics and related fields has shown that topic boundaries (as well as similar entities such as paragraph boundaries in read speech, or discourse-level boundaries in spontaneous speech) are indicated prosodically in a manner that is similar to sentence or utterance boundaries--only stronger. Major shifts in topic typically show longer pauses, an extra-high F0 onset or &amp;quot;reset,&amp;quot; a higher maximum accent peak, greater range in F0 and intensity (Brown, Currie, and Kenworthy 1980; Grosz and Hirschberg 1992; Nakajima and Allen 1993; Geluykens and Swerts 1993; Ayers 1994; Hirschberg and Nakatani 1996; Nakajima and Tsukada 1997; Swerts 1997) and shifts in speaking rate (Brubaker 1972; Koopmans-van geinum and van Donzel 1996; Hirschberg and Nakatani 1996). Such cues are known to be salient for human listeners; in fact, subjects can perceive major discourse boundaries even if the speech itself is made unintelligible via spectral filtering (Swerts, Geluykens, and Terken 1992).</Paragraph>
      <Paragraph position="1"> Work in automatic extraction and computational modeling of these characteristics has been more limited, with most of the work in computational prosody modeling dealing with boundaries at the sentence level or below. However, there have been some studies of discourse-level boundaries in a computational framework. They differ in various ways, such as type of data (monologue or dialogue, human-human or human-computer), type of features (prosodic and lexical versus prosodic only), which features are considered available (e.g., utterance boundaries or no boundaries), to  Computational Linguistics Volume 27, Number 1 what extent features are automatically extractable and normalizable, and the machine learning approach used. Because of these vast difference, the overall results cannot be compared directly to each other or to our work, but we describe three of the approaches briefly here.</Paragraph>
      <Paragraph position="2"> An early study by Litman and Passonneau (1995) used hand-labeled prosodic boundaries and lexical information, but applied machine learning to a training corpus and tested on unseen data. The researchers combined pause, duration, and hand-coded intonational boundary information with lexical information from cue phrases (such as and and so). Additional knowledge sources included complex relations, such as coreference of noun phrases. Work by Swerts and Ostendorf (1997) used prosodic features that in principle could be extracted automatically, such as pitch range, to classify utterances from human-computer task-oriented dialogue into two categories: initial or noninitial in the discourse segment. The approach used CART-style decision trees to model the prosodic features, as well as various lexical features that, in principle, could also be estimated automatically. In this case, utterances were presegmented, so the task was to classify segments rather than find boundaries in continuous speech; some of the features included, such as type of boundary tone, may not be easy to extract robustly across speaking styles. Finally, Hirschberg and Nakatani (1998) proposed a prosody-only front end for tasks such as audio browsing and playback, which could segment continuous audio input into meaningful information units. They used automatically extracted pitch, energy, and &amp;quot;other&amp;quot; features (such as the cross-correlation value used by the pitch tracker in determining the estimate of F0) as inputs to CART-style trees, and aimed to predict major discourse-level boundaries. They found various effects of frame window length and speakers, but concluded overall that prosodic cues could be useful for audio browsing applications.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML