File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/w02-0714_metho.xml
Size: 24,314 bytes
Last Modified: 2025-10-06 14:08:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0714"> <Title>IMPROVEMENTS IN NON-VERBAL CUE IDENTIFICATION USING MULTILINGUAL PHONE STRINGS</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. THE MULTILINGUAL PHONE STRING APPROACH </SectionTitle> <Paragraph position="0"> The basic idea of the multilingual phone string approach is to use phone strings produced by different context-independent phone recognizers instead of traditional short-term acoustic vectors [6]. For the classification of an audio segment into one ofa2 classes of a specific non-verbal cue, a3 such phone recognizers together with a3a5a4a6a2 phonotactic N-gram models produce an a3a7a4a8a2 matrix of features. A best class estimate is made based solely on this feature matrix. The process relies on the availability of a3 phone recognizers, and the training of a3a7a4a8a2 N-gram models on their output.</Paragraph> <Paragraph position="1"> By using information derived from phonotactics rather than directly from acoustics, we expect to cover speaker idiosyncrasy and accent-specific pronunciations. Since this information is provided from complementary phone recognizers, we anticipate greater robustness under mismatched conditions. Furthermore, the approach is somewhat language independent since the recognizers are trained on data from different languages.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1. Phone Recognition </SectionTitle> <Paragraph position="0"> The experiments presented here were conducted using two versions of phone recognizers borrowed without modification from the GlobalPhone project [5]. All were Association for Computational Linguistics.</Paragraph> <Paragraph position="1"> Algorithms and Systems, Philadelphia, July 2002, pp. 101-108. Proceedings of the Workshop on Speech-to-Speech Translation: trained using our Janus Recognition Toolkit (JRTk).</Paragraph> <Paragraph position="2"> The first set of phone recognizers, which we refer to as our baseline, includes recognizers for: Mandarin Chinese (CH), German (DE), French (FR), Japanese (JA), Croatian (KR), Portuguese (PO), Spanish (SP) and Turkish (TU).</Paragraph> <Paragraph position="3"> For each language, the acoustic model consists of a context-independent 3-state HMM system with 128 Gaussians per state. The Gaussians are on 13 Mel-scale cepstral coefficients with first and second order derivatives and power. Following cepstral mean subtraction, linear discriminant analysis reduces the input vector to 32 dimensions.</Paragraph> <Paragraph position="4"> The second set consists of extended phone recognizers, available in 12 languages. Arabic (AR), Korean (KO), Russian (RU) and Swedish (SW) are available in this set in addition to the languages named above for the baseline set. The 12 new phone recognizers were derived from an improved generation of context dependent LVCSR systems which also include vocal tract normalization (VTLN) for speaker normalization. For decoding, we used an unsupervised scheme to find the best warp factor for a test speaker and calculate a viterbi alignment based on that speaker's best warp factor. To improve system speed, we reduced the number of Gaussians per state from 128 to 16; in addition, the feature dimension was halved from 32 to 16 using linear discriminant analysis. Figure 1 shows the phone error rates in relation to the number of modeled phones for eight languages. The error rate correlates with the number of phones used to model this language. Turkish seems to be an exception to this finding. The error analysis showed that this is due to a very high substiaudio null phone string phone string</Paragraph> <Paragraph position="6"> In classifying a non-verbal cue a28 into one of a2 classes, a28a30a29 , our feature extraction scheme requires a3a31a4a32a2 distinct phonotactic modelsa33a35a34a37a36a39a38a29 , a40a42a41a44a43a45a41a46a3 and a40a47a41a8a48a49a41a44a2 , one for each combination of phone recognizera33a45a50a51a36 with output class a28 a29 . a33a35a34a52a36a53a38a29 is trained on phone strings produced by phone recognizer a33a45a50a54a36 on a28 a29 training audio as shown in model. This procedure does not require transcription at any level.</Paragraph> <Paragraph position="7"> During classification, each ofa3 phone recognizers a60a61a33a35a50a55a36a63a62 , as used for phonotactic model training, decodes the test audio segment. Each of the resultinga3 phone strings is scored against each of a2 phonotactic models a60a61a33a35a34a37a36a53a38a29 a62 . This results in a perplexity matrix a64a47a64 , whose a65a67a66a47a66a42a68a69a36a39a38a29 element is the perplexity produced by phonotactic modela33a45a34 a36a53a38a29 on the phone string output of phone recognizera33a35a50 a36. Although we have explored some alternatives, our generic decision algorithm is to propose a class estimate a28a71a70 by selecting the lowest a72 a65a53a66a47a66a42a68a36a39a38a29 . Figure 3 depicts this procedure, which we refer to as MPM-pp.</Paragraph> <Paragraph position="8"> 3. EXPERIMENTS 3.1. Speaker Identification (SID) In order to investigate robust speaker identification under various distances, a distant-microphone database containing speech recorded from various microphone distances has been collected at the Interactive Systems Laboratory. This database contains 30 native English speakers reading different articles. Each of the five sessions per speaker are recorded using eight microphones in parallel: one close-speaking microphone (Dis 0), one lapel microphone (Dis L) worn by the speaker, and six other lapel microphones at distances of 1, 2, 4, 5, 6, and 8 feet from the speaker. In a first experiment, we compare the performance of the MPM-pp approach using our baseline phone recognizers to the GMM approach. About 7 minutes of spoken speech (approximately 5000 phones) is used for training the PMs, while for training the GMMs one minute was used. The different amount of training data for the two approaches seems to make the comparison quite unfair; however, the training data is used for very different purposes. In the GMM approach, the data is used to train the Gaussian mixtures. In the MPM approach, the data is solely used for creating phonotactic models; no data is used to train the Gaussian mixtures of the phone recognizers. And we have found that with a fixed configuration of GMM structures, adding more training data does not lead to noticeable improvement in performance [4].</Paragraph> <Paragraph position="9"> matched conditions for 10 second segments The GMM approach was tested on 10 seconds of audio, whereas the phone string approach was additionally tested on shorter and longer (up to one minute) segments. We report results for closed-set text-independent speaker identification. Table 1 shows the GMM results with one minute training data on 10 seconds of test data. It illustrates that the performance under mismatched conditions degrades considerably when compared to performance under matched conditions.</Paragraph> <Paragraph position="10"> Table 2 shows the identification results using each of the 8 baseline phone recognizers individually and their combination results for Dis 0 under matched conditions. This shows that multiple recognizers collectively compensate for the poor performance of single recognizers, an effect which becomes even more important for shorter test utterances. matched training and testing distance Table 3 and Table 4 compare the identification results for all distances for different test utterance lengths under matched and mismatched conditions, respectively. Under matched conditions, training and testing data are drawn from the same microphone. Under mismatched conditions, we do not know the test segment distance; we make use of all a73a75a74a77a76 sets of a33a35a34a78a36a39a38a29 phonotactic models, where a73 is the number of distances, and modify our decision rule to estimate a28a79a70</Paragraph> <Paragraph position="12"> is the index over phone recognizers, a48 is the index over speaker phonotactic models, and a40a87a41a89a88a90a41a6a73 . These two tables indicate that the performance of MPM-pp, unlike that of GMM, is comparable for matched and mismatched conditions.</Paragraph> <Paragraph position="13"> We conducted additional experiments to determine the impact of the improved GlobalPhone recognizers on the identification rate for this task. To that end, we used all 8 baseline recognizers and only the corresponding 8 of the 12 available improved recognizers. Table 5 compares the speaker identification rate on matched conditions for 60 seconds of audio. The comparison indicates that an improvement in phone error rate leads to slight improvements in speaker identification rate for distances Dis 0 and Dis 5. Performance decreases for Dis L, while for Dis 6 the improved recognizers outperform the baseline significantly. Overall we cannot conclude from these results that better phone recognizers result in a higher identification rate. However, we can summarize that the improved engines show an identification performance of 93.3% or higher on all distances for matched conditions on 60 seconds of audio, in spite of the drastic reduction in acoustic model parameter dimensions.</Paragraph> <Paragraph position="14"> line and improved phone recognizers on matched conditions for 60 seconds of audio (SID rate in %)</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2. Accent Identification (AID) </SectionTitle> <Paragraph position="0"> In previous experiments on accent identification we used the MPM-pp approach to identify native and non-native speakers of English and to identify speakers of varying profi- null and total length of audio for native and non-native classes In our current experiments, we have augmented the number of phonotactic models used to classify utterances. We decode training data from each class using the baseline phone recognizer for Chinese and run our original experiments with a new bank of phonotactic models in 7 languages: the original 6a33a45a50a101a100a58a102a87a60a104a103a27a105a101a106a69a107a108a50a47a106a110a109a11a111a79a106a69a112a47a50a47a106a110a33a114a113a79a106a110a115a116a33a101a62 , plus a60a118a117a114a119a42a62 . During classification, thea120a54a4a83a121 phonotactic models produce a perplexity matrix for the test utterance to which we apply our lowest average perplexity decision rule; the class with the lower perplexity is identified as the class of the test utterance.</Paragraph> <Paragraph position="1"> On our evaluation set of 303 utterances for 2-way classification between native and non-native speakers, our classification accuracy improves from 93.7% using models in 6 languages to 97.7% using models in 7 languages.</Paragraph> <Paragraph position="2"> An examination of the average perplexity of each class of phonotactic model over all test utterances reveals the improved separability of the classes. The average perplexity of non-native models on non-native data is lower than the perplexity of native models on that data, and the discrepancy between these numbers grows after adding training data decoded in an additional language.</Paragraph> <Paragraph position="3"> The native models became less separable on average but discriminatory power still improved overall. Table 7 shows these average perplexities for our previous and current experiments.</Paragraph> <Paragraph position="4"> non-native classes using 6 phone recognizers (top) versus 7 (bottom) In the proficiency-level experiments, we apply the MPM-pp approach to classify utterances from non-native speakers according to assigned speaker proficiency class. The original non-native data has been labelled with the proficiency of each speaker on the basis of a standardized evaluation procedure conducted by trained proficiency raters [7], and we attempt to classify non-native speakers from three classes according to their proficiency. Class 1 represents the lowest proficiency speakers, class 2 contains intermediate speakers, and class 3 contains the high proficiency speakers.</Paragraph> <Paragraph position="5"> Profiles of the testing and training data for these experiments are shown in Table 8.</Paragraph> <Paragraph position="6"> total length of audio and average speaker proficiency score per proficiency class We have added phonotactic models trained on Chinese recognizer output to this experiment as well, and gained a small improvement over our results using models in 6 languages. Table 9 displays two confusion matrices for this task, one showing original results and one showing results with the added Chinese phone recognizer.</Paragraph> <Paragraph position="7"> tion using 6 phone recognizers (top) versus 7 (bottom) Classification accuracy in the 3-way proficiency classification task improves somewhat, rising from 59% in the original experiment to 61% using the additional phone recognizer. As the confusion matrix for this experiment shows, the phonotactic models trained in Chinese cause the system to correctly identify more of the class 2 utterances, but at the expense of some class 3 utterances which are also identified as class 2 by the new system.</Paragraph> <Paragraph position="8"> In both 2-way and 3-way classification, the addition of a seventh phone recognizer improved classification accuracy. Like the other applications of this approach, accent identification requires no hand-transcription and could easily be ported to test languages other than English/Japanese.</Paragraph> <Paragraph position="9"> 3.3. Language Identification (LID) For this task, we applied the non-verbal cue identification framework to the problem of multiclassification of four languages: Japanese (JA), Russian (RU), Spanish (SP) and Turkish (TU). We elected to use a small number of phone recognizers in languages other than the four classification languages in order to duplicate the circumstances common to our other non-verbal cue identification experiments, and to demonstrate a degree of language independence which holds even in the language identification domain. Phone recognizers in Chinese (CH), German (DE) and French (FR), with phone vocabulary sizes of 145, 47 and 42, respectively, were borrowed from the GlobalPhone project.</Paragraph> <Paragraph position="10"> In this section, we first reiterate our accuracy results using phone recognizers drawn from the baseline set; the details of those experiments are discussed in [4]. We then compare both the identification accuracy and realtime performance with results obtained using the improved GlobalPhone phone recognizers.</Paragraph> <Paragraph position="11"> The data for this classification experiment, also borrowed from the GlobalPhone project but not used in training the phone recognizers, was divided as shown in Table 10. Data set 1 was used for training the phonotactic models, while data set 4 was completely held-out during training and used to evaluate the end-to-end performance of the complete classifier. Data sets 2 and 3 were used as development sets while experimenting with different decision strategies.</Paragraph> <Paragraph position="12"> utterances and total length of audio per language For phonotactics, utterances from set 1 in each a123 a29a75a102a124a60a104a109a125a111a47a106a69a50a127a126a79a106a110a115a116a33a51a106a69a128a101a126a47a62 were decoded using each of the three phone recognizers a33a45a50a54a36a54a102a46a60a21a117a114a119a42a106a69a103a27a105a54a106a110a107a108a50a129a62 and 12 separate trigram models were constructed with Kneser/Ney backoff and no explicit cut-off.</Paragraph> <Paragraph position="13"> We first benchmarked accuracy using our lowest average perplexity decision rule. For comparison, we constructed a separate 4-class multiclassifier, using data set 2, for each of the four durations a99 a86 a102a130a60a61a131a57a132a98a106a133a40a133a134a57a132a95a106a110a121a57a134a57a132a135a106a110a136a57a134a57a132a98a62 . Our multi-classifier combined the output of multiple binary classifiers using an error-correcting output coding (ECOC) technique.</Paragraph> <Paragraph position="14"> A class space of 4 languages induces 7 unique binary partitions. For each of these, we trained an independent multilayer perceptron (MLP) with 12 input units and 1 output unit using scaled conjugate gradients on data set 2 and early stopping using the cross-validation data set 3. In preliminary tests, we found that 25 hidden units provide adequate performance and generalization when used with early stopping. The output of all 7 binary classifiers was concatenated together to form a 7-bit code, which in the flavor of ECOC was compared to our four class codewords to yield a best class estimate. Based on total error using the best training set weights and cross-validation set weights on the cross-validation data, we additionally discarded those binary classifiers which contributed to total error; these classifiers represent difficult partitions of the data.</Paragraph> <Paragraph position="15"> With phone recognizers drawn from the baseline set, classification accuracy using lowest average perplexity led to 94.01%, 97.57%, 98.96% and 99.31% accuracy on 5s, 10s, 20s and 30s data respectively, while with ECOC/MLP classification accuracy improved to 95.41%, 98.33%, 99.36% and 99.89% respectively.</Paragraph> <Paragraph position="16"> Replacement of the baseline phone recognizers with ones from the extended and improved GlobalPhone set led to classification accuracies, using lowest average perplexity, of 94.83%, 97.89%, 98.98%, and 99.26% on 5s, 10s, 20s and 30s data respectively. All classification rate results are plotted in Figure 4. Comparing the lowest average perplexity results from the old with the new recognizers shows that the improved recognizers lead to higher improvements for the short utterance segments, for the 30s segments the results are slightly worse.</Paragraph> <Paragraph position="17"> The runtime performance of the phone recognition component was assessed, using set 1 of the data in Table 10, on a dual CPU 933 MHz Pentium III machine with 512 MB of memory and 900 MB of swap with low load. Realtime factors are presented for both the baseline set and the improved set of phone recognizers in Table 11.</Paragraph> <Paragraph position="18"> recognizers While the difference in classification accuracy between the baseline identification system and that built using the improved phone recognizers is perhaps not statistically significant, the improvements in runtime speed are very dramatic. For vastly different vocabulary sizes, the improvement is almost a whole order of magnitude.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 4. LANGUAGE DEPENDENCIES </SectionTitle> <Paragraph position="0"> Implicit in our non-verbal cue classification methodology is the assumption that phone strings originating from phone recognizers trained on different languages yield crucially complementary information. In [4] we performed some initial experiments to explore the influence of variation in phone recognizers, and how the identification rate varies with the number of phone recognizers used. In this section we report on two follow-up experiments for the speaker identification task intended to answer these questions.</Paragraph> <Paragraph position="1"> 4.1. Multiple languages vs single language multiple engines null We conducted one set of experiments to investigate whether the reason for the success of the multilingual phone string approach is related to the fact that different languages contribute useful classification information or that it simply lies in the fact that different recognizers provide complementary information. If the latter were the case, a multi-engine approach in which phone recognizers were trained on the same language but on different channel or speaking style conditions might do a comparably good job. To test this hypothesis, we used three different phone recognizers all trained on English data, but under different channel conditions (telephone, channel-mix, clean) and different speaking styles (highly conversational, spontaneous, planned) [4].</Paragraph> <Paragraph position="2"> The experiments were carried out on matched conditions on all distances for 60 seconds of audio for the speaker identification task. To compare the three English engines to the multiple language engines, we generated all possible language triples out of the set of languages and calculated the average, minimum and maximum performance over all triples. We evaluated the performance for both recognizer versions, the baseline 8-language phone recognizers and the improved 12-language ones. In the first case, we generated all possible language triples out of the set of eight languages (a65a98a137a138a133a68a8a74a139a131a57a140 triples); in the latter, we did the same out of the set of twelve languages (a65a104a141a63a142a138 a68a143a74a81a121a11a121a21a134 triples). In both cases, we calculated the average, minimum and maximum performance over all triples. The results are given in Table 12.</Paragraph> <Paragraph position="3"> recognizers (SID rates in %) The improved versions of the multiple language phone recognizers give significantly better average results for most of the distances. The results also show that the multiple English engine approach in almost all cases lies within the range of the multilingual approach. However, the average performance of the multiple language approach using the improved engines always outperforms the multiple English engine approach. This indicates that most of the language triples achieve better results than the single-language multiple engines.</Paragraph> <Paragraph position="4"> In summary, table 12 shows that best performance of the multi-language approach always outperforms the multiple English engine approach; moreover, in the case of the 12 improved GlobalPhone engines, even the average always outperforms the multiple English engine approach. From these results, we draw the conclusion that multiple English language recognizers provide less useful information for the classification task than do multiple language phone recognizers. This is at least true for the given choice of multiple engines in the context of speaker identification. The fact that the multiple engines were trained on English, i.e. the same language which is spoken in the speaker identification task, whereas the multiple languages were trained on 12 languages other than English, makes the multiple language approach even more appealing since it indicates a great potential for portability to non-verbal cue identification on other languages.</Paragraph> <Paragraph position="5"> 4.2. Number of involved languages In this set of experiments, we investigated the influence of the number of phone recognizers on the speaker identification performance. These experiments were performed on the improved version of GlobalPhone phone recognizers in 12 languages. Figure 5 plots the speaker identification rate over the number a88 of languages used in the identification process on matched conditions on 60 seconds of audio for all distances. The performance given for each distance is an average over the a65a133a141a144a142 a68 languagea88 -tuples. The results indicate that the average speaker identification rate increases for all distances with the number of involved phone recognizers. For some distances, a saturation effect occurs beyond 6 languages (distance 0 and 1); for other distances, even the a40a104a121 a97a53a145 language has a positive effect on the average performance (distance 4, 6, L). The increasing average indicates that the probability of finding a suitable language-tuple which optimizes performance increases with the number of available languages. We also analyzed whether the increasing performance is related to the total number of phones used for the classification process rather than the number of different engines, but did not find evidence for a strong correlation.</Paragraph> </Section> class="xml-element"></Paper>