File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/i05-3010_metho.xml
Size: 14,040 bytes
Last Modified: 2025-10-06 14:09:35
<?xml version="1.0" standalone="yes"?> <Paper uid="I05-3010"> <Title>Turn-taking in Mandarin Dialogue: Interactions of Tone and Intonation</Title> <Section position="3" start_page="72" end_page="72" type="metho"> <SectionTitle> 2 Experimental Data </SectionTitle> <Paragraph position="0"> The data in this study was drawn from the Taiwanese Putonghua Speech Corpus1. The materials chosen include 5 spontaneous dialogues by Taiwanese speakers of Mandarin, seven females and three males. The dialogues, averaging 20 minutes in duration, were recorded on two channels, one per speaker, in a quiet room and digitized at 16KHz sampling. The recordings were later manually transcribed and segmented into words; turn-beginnings and overlaps were timestamped. The manual word segmentation was based on both syntax and phonology according to a methodology described in detail in (Duanmu, 1996).</Paragraph> <Section position="1" start_page="72" end_page="72" type="sub_section"> <SectionTitle> 2.1 Prosodic Features </SectionTitle> <Paragraph position="0"> For the subsequent analysis, the conversations were divided into chunks based on the turn and overlap time-stamps. Using a Chinese character-to-pinyin dictionary and a hand-constructed mapping of pinyin sequences to ARPABET phonemes, the transcribed text was then force-aligned to the corresponding audio segments using the language porting mechanism in the University of Colorado Sonic speech recognizer (Pellom et al., 2001). The resulting alignment provided phone, syllable, and word durations as well as silence positions and durations. Pitch and intensity values for voiced regions were computed using the functions &quot;To Pitch&quot; and &quot;To Intensity&quot; in the freely available Praat acoustic analysis software package(Boersma, 2001).</Paragraph> <Paragraph position="1"> We then computed normalized pitch and intensity values based on log-scaled z-score normalization of each conversation side. Based on the above alignment, we then computed maximum and mean pitch and intensity values for each syllable and word for all voiced regions. Given the presence of lexical tone, we extracted five points evenly distributed across the &quot;final&quot; region of the syllable, excluding the initial consonant, if any.</Paragraph> <Paragraph position="2"> We then estimated the linear syllable slope based on the latter half of this region in which the effects of tonal coarticulation are likely to be minimized under the pitch target approximation model(Xu, 1997).</Paragraph> </Section> </Section> <Section position="4" start_page="72" end_page="74" type="metho"> <SectionTitle> 3 Acoustic Analysis of Turn-taking </SectionTitle> <Paragraph position="0"> Each of the turn units extracted above was tagged based on its starting and ending conditions as one of four types: smooth, rough, intersmooth, and interrough. &quot;Smooth&quot; indicates a segment-ending transition from one speaker to another, not caused by the start of overlap with another speaker. By contrast, a rough transition indicates the end of a chunk at the start of overlap with another speaker. The prefix &quot;inter&quot; indicates the turn began with an interruption, identified by overlap with the previous speaker and change of speaker holding the floor. In this class, the new speaker continues to hold the floor after the period of overlap.</Paragraph> <Paragraph position="1"> We contrast turn unit initial and turn unit final syllables for each type of transition and across all turns. We compare mean pitch and mean intensity in each case. We find in all cases highly significant differences between mean pitch of turn unit initial syllables and mean pitch of final syllables (p < 0.0001) as illustrated in Figure 1. Syllables in initial position have much higher log-scaled mean pitch in all conditions. For intensity, we find highly significant differences across all conditions (p < 0.005), with initial syllables having higher amplitude than final syllables. These contrasts appear in Figure 2. Furthermore, comparing final intensity of transitions not marked by the start of overlap with the intensity of the final preoverlap syllable in a transition caused by overlap, we find significantly higher normalized mean intensity in all rough transitions relative to others.</Paragraph> <Paragraph position="2"> In contrast, comparable differences in pitch do not reach significance.</Paragraph> <Paragraph position="3"> Finally we compare smooth turn unit initiations (&quot;smooth&quot;) to successful interruptions (&quot;interrough&quot;,&quot;intersmooth&quot;), contrasting initial syllables in each class. Here we find that both normalized mean pitch (Figure 3) and normalized mean intensity (Figure 4) in turn unit initial syllables are significantly higher in interruptions than initial and final position across turn types. Values for initial position are in grey, final position in black.</Paragraph> <Paragraph position="4"> bles in smooth turn transitions and interruptions.</Paragraph> <Paragraph position="5"> Values for smooth transitions are in black, interruptions in grey.</Paragraph> <Paragraph position="6"> prosodic cues to take the floor by interruption.</Paragraph> <Paragraph position="7"> These descriptive analyses demonstrate that intonational cues to turn-taking do play a role in a tone language. Not only does intensity play a significant role, but pitch also is employed to distinguish initiation and finality, in spite of its concurrent use in determining lexical identity. In the following section, we describe the effects on tone height and tone shape caused by these broader intonational phenomena.</Paragraph> </Section> <Section position="5" start_page="74" end_page="74" type="metho"> <SectionTitle> 4 Tone and Intonation </SectionTitle> <Paragraph position="0"> We have determined that syllables in turn unit final position have dramatically reduced average pitch relative to those in turn unit initial position, and these contrasts can serve to signal turn-change and speaker change as suggested by (Dunclasses, but differences for intensity do not reach signifi-</Paragraph> <Paragraph position="2"> canonical tones in turn non-final and final positions. Values for non-final positions are in grey, final positions in black.</Paragraph> <Paragraph position="3"> can, 1974). How do these changes interact with lexical identity and lexical tone? Since tone operates on a syllable in Chinese, we consider the average pitch and tone contours of syllables in final and non-final position. We find that average pitch for all tones is reduced, and relative tone height is largely preserved.3 Thus a final high tone is readily distinguishable from a final low tone, if the listener can interpret the syllable as turn-final. The contrasts appear in Figure 5.</Paragraph> <Paragraph position="4"> Turning to tone contour, we find likewise little change between non-final and final contours, with the contours running parallel, but at a much lower pitch.4 For illustration, mid-rising and high falling tones are shown in Figure 6. Comparable behavior has been observed at other discourse boundaries such as story boundaries in newswire speech. (Levow, 2004).</Paragraph> </Section> <Section position="6" start_page="74" end_page="76" type="metho"> <SectionTitle> 5 Recognizing Turn Unit Boundaries </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="74" end_page="75" type="sub_section"> <SectionTitle> and Interruptions </SectionTitle> <Paragraph position="0"> Based on the salient contrasts in pitch and intensity observed above, we employ prosodic features both to identify turn boundaries and to distinguish between the start of interruptions and smooth transitions. We further contrast the use of prosodic features with text n-gram features.</Paragraph> <Paragraph position="1"> canonical forms even in non-final position. This variation may be attributed to a combination of tonal coarticulatory effects and the presence of other turn-internal boundaries. rising and high falling tones in turn non-final and final positions. Values for non-final positions are in heavy lines, final positions in thin lines. Midrising tone is in black, dashed lines, high falling in solid lines.</Paragraph> </Section> <Section position="2" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 5.1 Classifier Features: Prosodic </SectionTitle> <Paragraph position="0"> The features used in the classifier trained to recognize turn boundaries and turn types fall into two classes: local and contextual. The local features describe the words or syllables themselves, while the contextual features capture contrasts between adjacent words or syllables. The first set of features thus includes the mean pitch and mean intensity for the current word and syllable, the word duration, and the maximum pitch and intensity for the syllable. The second set of features include the length of any following silence and the differences in mean pitch and mean intensity between the current word or syllable and the following word or syllable.</Paragraph> </Section> <Section position="3" start_page="75" end_page="75" type="sub_section"> <SectionTitle> 5.2 Classifier Features: Text </SectionTitle> <Paragraph position="0"> For contrastive purposes, we also consider the use of textual features for turn boundary and boundary type classification.5 Here we exploit syllable and word features, as well as syllable n-gram features. We use the toneless pinyin representation of the current word and the final syllable in each word. Such features aim to capture, for example, question particles that signal the end of a turn.</Paragraph> <Paragraph position="1"> In addition, we extracted the five preceding and five following syllables in the sequence around the current syllable. We then experimented with different window widths for n-gram construction, 5All text features are drawn from the ground truth manual text transcripts.</Paragraph> <Paragraph position="2"> ranging from one to five, as supported by the classifier described below.</Paragraph> </Section> <Section position="4" start_page="75" end_page="76" type="sub_section"> <SectionTitle> 5.3 Classifiers </SectionTitle> <Paragraph position="0"> We performed experiments with several classifiers: Boostexter (Schapire and Singer, 2000), a well-known implementation of boosted weak learners, multi-class Support Vector Machines with linear kernels (C-C.Cheng and Lin, 2001), an implementation of a margin-based discriminative classifier, and decision trees, implemented by C4.5(Quinlan, 1992). All classifiers yielded comparable results on this classification task. Here we present the results using Boostexter to exploit its support for text features and automatic n-gram feature selection as well as its relative interpretability. We used downsampled balanced training and test sets to enable assessment of the utility of these features for classification and employed 10-fold cross-validation, presenting the average over the runs.</Paragraph> <Paragraph position="1"> Using the features above we created a set of 1610 turn unit final words and 1610 non-final words. Based on 10-fold cross-validation, using combined text and prosodic features, we obtain an accuracy of 93.1% on this task. The key prosodic features in this classification are silence duration, which is the first feature selected, and maximum intensity. The highest lexical features are preceding 'ta', preceding 'ao', and following 'dui'. If silence features are excluded, classification accuracy drops substantially to 69%, still better than the 50% chance baseline for this set. In this case, syllable mean intensity features become the first selected for classification.</Paragraph> <Paragraph position="2"> We also consider the relative effectiveness of classifiers based on text, with silence, or prosodic features alone. We find that, when silence duration features are available, both text- and prosodic-based classifiers perform comparably at 93.5% and 93.7% accuracy respectively, near the effectiveness of the combined text and prosodic classifier. However, when silence features are excluded, a greater difference is found between classification based on text features and classification based on prosodic features. Specifically, without silence information, classification based on text features alone reaches only 59.5%. However, classification based on prosodic features remains somewhat more robust, though still with a substantial drop in accuracy, at 66.5% for prosody only. This finding suggests that although the presence of a longer silence interval is the best cue to finality, additional prosodic features, such as differences in pitch and intensity, concurrently signal the opportunity for another speaker to start a turn. Text features, especially in highly disfluent conversational speech, provide less clear evidence. Results appear in Table 5.3.1.</Paragraph> <Paragraph position="3"> In order to create a comparable context for initial words in interruption and smoothly initiated turns, we reversed the direction of the contextual comparisons, comparing the preceding word features to those of the current word and measuring pre-word silences rather than following silences. Using this configuration, we created a set of 218 interruption initial words and 218 smooth transition initial words, following smooth transitions without overlap. Based on 10-fold cross-validation for this downsampled balanced case, we obtain an accuracy of 62%, relative to a 50% baseline. The best classifiers employed only prosodic features with silence duration and normalized mean word pitch. Addition of text features degrades test set performance as the classifier rapidly overfits to the training materials. If we exclude silence related features, accuracy drops to almost chance.</Paragraph> </Section> </Section> class="xml-element"></Paper>