File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/02/w02-0714_concl.xml
Size: 3,007 bytes
Last Modified: 2025-10-06 13:53:23
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0714"> <Title>IMPROVEMENTS IN NON-VERBAL CUE IDENTIFICATION USING MULTILINGUAL PHONE STRINGS</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 5. CONCLUSIONS </SectionTitle> <Paragraph position="0"> We have investigated the identification of non-verbal cues from spoken speech, namely speaker, accent, and language.</Paragraph> <Paragraph position="1"> For these tasks, a joint framework was presented which uses phone strings, derived from different phone recognizers, as intermediate features and which performs classification decisions based on their perplexities. Our good identification results validate this concept, indicating that multilingual phone strings could be sucessfully applied to the identification of various non-verbal cues, such as speaker, accent and language. The evaluation on our distant microphone database proved the robustness of the approach, achieving a 96.7% speaker identification rate on 10 seconds of audio from 30 speakers under mismatched conditions, clearly outperforming GMMs on large distances. We achieved similar results using phone recognizers with a drastically reduced parameter dimension. Furthermore, in the speaker identification domain we showed that, on average, the use of phone recognizers trained on different languages leads to greater accuracy than do multiple same-language phone recognizers.</Paragraph> <Paragraph position="2"> Our classification framework performed equally well in the domains of accent and language identification. We achieved 97.7% discrimination accuracy between native and non-native English speakers, showing that the addition of a seventh recognizer to this task, namely Chinese, reduced the error rate by 63%.</Paragraph> <Paragraph position="3"> For language identification, we obtained 95.5% classification accuracy for utterances 5 seconds in length and up to 99.89% on longer utterances, showing additionally that some reduction of error is possible using decision strategies which rely on more than just lowest average perplexity.</Paragraph> <Paragraph position="4"> Furthermore, accuracy was shown to improve, at least for short utterance durations, using phone recognizers which are more accurate but constrained to a much smaller parameter space. While retaining classification accuracy, these phone recognizers run faster than realtime, outperforming the speed of the baseline by almost 90%.</Paragraph> <Paragraph position="5"> The speaker and accent identification experiments were carried out on English data, although none of the applied phone recognizers were trained or adapted to English spoken speech. Similarly, our language identification experiments were run on languages not presented to the phone recognizers for training. The language independent nature of our experiments suggests that they could be successfully ported to non-verbal cue classification in other languages.</Paragraph> </Section> class="xml-element"></Paper>