XML Viewer - n04-4035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-4035_metho.xml
Size: 10,210 bytes
Last Modified: 2025-10-06 14:08:54
<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-4035">
  <Title>Prosody-based Topic Segmentation for Mandarin Broadcast News</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Prosody and Mandarin
</SectionTitle>
    <Paragraph position="0"> In this paper we focus on topic segmentation in Mandarin Chinese broadcast news. Mandarin Chinese is a tone language in which lexical identity is determined by a pitch contour - or tone - associated with each syllable. This additional use of pitch raises the question of the cross-linguistic applicability of the prosodic cues, especially pitch cues, identified for non-tone languages. Specifically, do we find intonational cues in tone languages? The fact that emphasis is marked intonationally by expansion of pitch range even in the presence of Mandarin lexical tone (Shen, 1989) suggests encouragingly that prosodic, intonational cues to other aspects of information structure might also prove robust in tone languages.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Prosodic Features
</SectionTitle>
    <Paragraph position="0"> We consider four main classes of prosodic features for our analysis and classification: pitch, intensity, silence and duration. Pitch, as represented by f0 in Hertz was computed by the &amp;quot;To pitch&amp;quot; function of the Praat system (Boersma, 2001). We selected the highest ranked pitch candidate value in each voiced region. We then applied a 5-point median filter to smooth out local instabilities in the signal such as vocal fry or small regions of spurious doubling or halving. Analogously, we computed the intensity in decibels for each 10ms frame with the Praat &amp;quot;To intensity&amp;quot; function, followed by similar smoothing.</Paragraph>
    <Paragraph position="1"> For consistency and to allow comparability, we compute all figures for word-based units, using the ASR transcriptions provided with the TDT Mandarin data. The words are used to establish time spans for computing pitch or intensity mean or maximum values, to enable durational normalization and the pairwise comparisons reported below, and to identify silence duration.</Paragraph>
    <Paragraph position="2"> It is well-established (Ross and Ostendorf, 1996) that for robust analysis pitch and intensity should be normalized by speaker, since, for example, average pitch is largely incomparable for male and female speakers. In the absence of speaker identification software, we approximate speaker normalization with story-based normalization, computed as a0a2a1a4a3a6a5a8a7a10a9a11a1a2a12a7a10a9a11a1a13a12 , assuming one speaker per topic1. For duration, we consider both absolute and normalized word duration, where average word duration is used as the mean in the calculation above.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Data Set
</SectionTitle>
    <Paragraph position="0"> We utilize the Topic Detection and Tracking (TDT) 3 (Wayne, 2000) collection Mandarin Chinese broadcast news audio corpus as our data set. Story segmentation in Mandarin and English broadcast news and newswire text was one of the TDT tasks and also an enabling technology for other retrieval tasks. We use the segment boundaries provided with the corpus as our gold standard labeling. Our collection comprises 3014 stories drawn from approximately 113 hours over three months (October-December 1998) of news broadcasts from the Voice of America (VOA) in Mandarin Chinese. The transcriptions span approximately 740,000 words. The audio is stored in NIST Sphere format sampled at 16KHz with 16-bit linear encoding.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Prosodic Analysis
</SectionTitle>
    <Paragraph position="0"> To evaluate the potential applicability of prosodic features to story segmentation in Mandarin Chinese, we performed some initial data analysis to determine if words in story-final position differed from the same words used throughout the story. This lexical match allows direct pairwise comparison. We anticipated that since words in Mandarin varied not only in phoneme sequence but also in tone sequence, a direct comparison might be particularly important to eliminate sources of variability. Features that differed significantly would form the basis of our classifier feature set.</Paragraph>
    <Paragraph position="1">  normalized intensity between words in segment non-final and segment-final positions.</Paragraph>
    <Paragraph position="2"> We found highly significant differences based on paired t-test two-tailed, (a0a2a1a4a3a6a5a7a5a9a8a11a10a13a12a15a14a17a16a18a10a20a19a10a7a10a22a21a7a23 ), for each of the features we considered. Specifically, word duration, normalized mean pitch, and normalized mean intensity all differed significantly for words in topic-final position relative to occurrences throughout the story (Figure 1). Word duration increased, while both pitch and intensity decreased. A small side experiment using 15 hours of English broadcast news from the TDT collection shows similar trends, though the magnitude of the change in intensity is smaller than that observed for the Chinese.</Paragraph>
    <Paragraph position="3"> These contrasts are consistent with, though in some cases stronger than, those identified for English (Nakatani et al., 1995) and Dutch (Swerts, 1997). The relatively large size of the corpus enhances the salience of these effects. We find, importantly, that reduction in pitch as a signal of topic finality is robust across the typological contrast of tone and non-tone languages. These findings demonstrate highly significant intonational effects even in tone languages and suggest that prosodic cues may be robust across wide ranges of languages.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
7 Classification
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.1 Feature Set
</SectionTitle>
      <Paragraph position="0"> The results above indicate that duration, pitch, and intensity should be useful for automatic prosody-based identification of topic boundaries. To facilitate cross-speaker comparisons, we use normalized representations of average pitch, average intensity, and word duration. We also include absolute word duration. These features form a word-level context-independent feature set.</Paragraph>
      <Paragraph position="1"> Since segment boundaries and their cues exist to contrastively signal the separation between topics, we augment these features with local context-dependent measures. Specifically, we add features that measure the change between the current word and the next word. 2 This contextualization adds four contextual features: change in normalized average pitch, change in normalized average intensity, change in normalized word duration, and duration of following silence.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.2 Classifier Training and Testing Configuration
</SectionTitle>
      <Paragraph position="0"> We employed Quinlan's C4.5 (Quinlan, 1992) decision tree classifier to provide a readily interpretable classifier.</Paragraph>
      <Paragraph position="1"> Now, the vast majority of word positions in our collection are non-topic-final. So, in order to focus training and test on topic boundary identification, we downsample our corpus to produce training and test sets with a 50/50 split of topic-final and non-topic-final words. We trained on 2789 topic-final words 3 and 2789 non-topic-final words, not matched in any way, drawn randomly from the full corpus. We tested on a held-out set of 200 topic-final and non-topic-final words.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
7.3 Classifier Evaluation
</SectionTitle>
      <Paragraph position="0"> The resulting classifier achieved 95.8% accuracy on the held-out test set, closely approximately pruned tree performance on the training set. This effectiveness is a substantial improvement over the sample baseline of 50%. 4 A portion of the decision tree is reproduced in Figure 2.</Paragraph>
      <Paragraph position="1"> Inspection of the tree indicates the key role of silence as well as the use of both contextual and purely local features of both pitch and intensity. Durational features play a lesser role in the classifier. The classifier relies on the theoretically and empirically grounded features of pitch and intensity and silence, where it has been suggested that higher pitch and wider range are associated with topic initiation and lower pitch or narrower range is associated with topic finality.</Paragraph>
      <Paragraph position="2"> We also performed a contrastive experiment where silence features were excluded, to assess the dependence on these features. 5 The resulting classifier achieved an accuracy of 89.4% on the heldout balanced test set, reinforcing the utility of pitch and intensity features for classification. null We performed a second set of contrastive experiments to explore the impact of different lexical tones on classification accuracy. We grouped words based on the lexical 2We have posed the task of boundary detection as the task of finding segment-final words, so the technique incorporates a single-word lookahead. We could also repose the task as identification of topic-initial words and avoid the lookahead to have a more on-line process. This is an area for future research.</Paragraph>
      <Paragraph position="3">  tone of the initial syllable into high, rising, low, falling, and neutral groups. We found no tone-based differences in classification with all groups achieving 94-96% accuracy. Since the magnitude of the difference in pitch based on discourse position is comparable to that based on lexical tone identity, and the overlap between pitch values in non-final and final positions is relatively small, we obtain consistent results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML