File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1613_metho.xml
Size: 8,791 bytes
Last Modified: 2025-10-06 14:07:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W01-1613"> <Title>A Study of Automatic Pitch Tracker Doubling/Halving &quot;Errors&quot;</Title> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Corpus Description </SectionTitle> <Paragraph position="0"> The 1992 and the 1993 dialogs from the TRAINS corpus Heeman and Allen (1995) were developed to facilitate the study of collaborative dialogs. In these dialogs, one person guided a &quot;user &quot; through a railroad freight system transportation task, and a monitor recorded the speech without interruption. Trained phoneticians labeled a subset of this speech with ToBI information Beckman and Ayers 1994/1997. Around 26 minutes of speech from a subset of these dialogs were analysed with respect to pitch. &quot;Wedw&quot; software was used Bunnell et al. (1992) by a linguistically trained annotator first in automatic mode. Hand consistency checks then examined glottal pulse locations in a wideband spectrogram. Wedw's wideband spectrogram displays an extremely darkened region where the glottis closes, approximating glottal pulse locations. Hess (1983) recommended use of a wideband spectrogram for manual verification of pitch tracks, but he conceded that wideband spectrograms do not provide sufficient resolution for the eye. In addition to use of a wideband spectrogram, the annotator carefully regarded the shape of the signal waveform, to be sure that glottal pulse locations were labeled consistently with respect to local peaks in the actual speech waveform. These dialogues were chosen for future in-depth investigations of what intonation-based features could be integrated into an automated dialogue system for determining user intentions and generating appropriate system responses.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Pitch Tracking Comparison </SectionTitle> <Paragraph position="0"> One concern in automatic pitch tracking is how to handle occasional events where an octave halving appears in the speech signal, but is not readily perceived by a human listener. The algorithm in Talkin (1995) addresses this issue with special constraints on dynamic programming cleanup of the pitch tracker output. Figure 1 below illustrates difficulties in making comparisons between pitch trackers in terms of doubling errors. Figure 1 plots a manually annotated pitch track that ranges from 75 Hz to 189 Hz, and an automatically generated pitch track that ranges from 74 Hz to 102 Hz, for the interval [1.37,1.98] of a 2.5 second utterance. The words of the utterance are &quot;and pick up three boxcars how long is that&quot;. A final rise can be heard at the end of the utterance, indicating a user's request for information from the system. The complete ToBI string associated with the utterance is &quot;H* L-L% L* H-H%&quot;. The last voiced section of utt10 shows the speaker vacillating between one octave and another, but the last ToBI string associated with the utterance is &quot;H-H%&quot;, meaning a high phrase accent followed by a high boundary tone. It would be surprising for the speaker to be speaking in the 90-100 Hz range reported by the pitch tracker, because the previous section of speech is actually an octave higher, in the 200235 Hz range. An octave pitch drop would not make sense in the context of a combination of high ToBI labels. The speaker is female. Initial comparisons are difficult because neither method precisely specifies the pitch information, so no pitch gold standard could be produced without significant manual verification of context-dependent doubling rules. When a section of speech appears to be halved in pitch, that halving could be a perceptually significant drop, or it could be a pitch tracker error.</Paragraph> <Paragraph position="1"> For the 320 utterances used in the evaluation (see Section 3), it was determined that roughly 40,419 10 ms frames had occurred where both methods predicted a voiced frame. When the ratio was taken of X/Y, where X was the automatic measurement, and Y, the hand measurement, it was the case that 96% of the time, this ratio was between .8 and 1.2, meaning that the automatic measurement was 20% off the hand measurement for 96% of the relevant cases. The distance of 20% can be used as a goal for past comparisons of pitch tracker outputs with a &quot;gold standard&quot;, although some studies have reported an allowance of 30 Hz Niemann et. al (1994). Using the 20% distance, these two methods of pitch look very similar.</Paragraph> <Paragraph position="2"> For determining halving amounts, one can consider the percentage of time that the ratio of the hand measurement to the automatic measurement was between 1.7 and 2.2. For the roughly 40,000 10 ms voicing-coincident 10 ms frames, .5% of them could be counted as a halving by the automatic pitch tracker for the female speakers, and .4% of the male speech was halved in pitch. One speaker, &quot;JT&quot;, female, comprised half of the female pitch measurements, and had a 1% pitch halving rate.</Paragraph> <Paragraph position="3"> One reason these proportions are so small is that the hand-verified data still has some halved data in it, as Figure 1 shows. For some measurements, pitch halvings are not &quot;errors&quot; at all, because they can directly reflect the information in the speech signal. When speech from the speaker &quot;JT&quot; of Figure 1 was correc ted for halving, 36% of the ratios between the hand-verified and the automatic data were between</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 1.7 and 2.2. 3 Detection of &quot;Rise&quot;/&quot;Falls&quot; </SectionTitle> <Paragraph position="0"> This section reports the results of applying a simple classification rule with respect to the different pitch methods. The idea comes from Daly (1994). Often, the last label in a ToBIlabeled utterance is a final boundary tone. For 320 utterances, this was the case, and an association was made between the &quot;H%&quot; (high) boundary tone and a &quot;Rise&quot; and the &quot;L%&quot; (low) boundary tone and a &quot;Fall&quot;. When the author listened to these utterances, thirteen were ruled out as not contributing a readily perceived tone.</Paragraph> <Paragraph position="1"> This coarse classification is a first approximation towards a perceptually based evaluation of pitch trackers that focuses on a section of an utterance considered linguistically special Pierrehumbert and Hirschberg (1990).</Paragraph> <Paragraph position="2"> The last part of an utterance can signal a user's intention, such as asking a question.</Paragraph> <Paragraph position="3"> For classifying final tones, firstly the average pitch value for the last voiced region was calculated, &quot;avgL&quot;, and the average pitch value of the remaining voiced regions was calculated, avg R&quot;. Next, the longest slope for the last voiced section was calculated, &quot;slopeL&quot;. Where &quot;avg L&quot; was greater than &quot;avgR&quot;, or &quot;slope L&quot; was positive, a final high tone was classified. Where &quot;avgL&quot; was less than &quot;avgR&quot;, or &quot;slopeL&quot; was negative, a final low tone was classified.</Paragraph> <Paragraph position="4"> This combination of slope calculations and simple comparisons were an improvement over the method used in Murray (2001). No other study of this magnitude (the hand labelings yielded roughly 100,000 data points) has been published that combines wideband spectrograms and signal shape to hand measure pitch tracks of conversational speech. Section 2 showed that for many cases, the outputs of the methods are similar. The hand-verified data could be used to closely examine contexts where a pitch tracker predicts a subharmonic of the perceived pitch.</Paragraph> <Paragraph position="5"> More sophisticated tone classification rules besides this preliminary one could be developed once the accuracy of pitch measurement on conversational speech has been improved.</Paragraph> <Paragraph position="6"> Table 1 below shows results of this simple classification with respect to hand-verified pitch measurements and automatic ones, and pvalues from a paired t-test. Overall, the hand verified measurements performed better in predicting rises and falls at a p<.001 level of significance.</Paragraph> <Paragraph position="7"> The preliminary classification rule used slightly favored female speech over male speech.</Paragraph> </Section> class="xml-element"></Paper>