File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/h94-1087_evalu.xml
Size: 5,260 bytes
Last Modified: 2025-10-06 14:00:14
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1087"> <Title>Language Identification via Large Vocabulary Speaker Independent Continuous Speech Recognition</Title> <Section position="5" start_page="439" end_page="440" type="evalu"> <SectionTitle> 4. DISCUSSION AND FUTURE WORK </SectionTitle> <Paragraph position="0"> As the initial Wall Street Journal trials indicate, large vocabulary continuous speech recognition is clearly a successful strategy for language discrimination on high-quality microphone speech. Unlike some other trials of language identification on read speech, the WSJ test was designed to control for such confounding factors as speaker and channel differences. The chief drawbacks of the test were its small size and the possible bias introduced by a recognizer so tuned to the Wall Street Journal grammar (despite our best efforts to choose Spanish data in a matched domain), but despite these objections the evidence for an LVCSR approach to language identification is very strong.</Paragraph> <Paragraph position="1"> The performance in the much harder domain of spontaneous telephone speech is more difficult to interpret. The preliminary testing described above differed in so many respects from the read speech experiments that it is hard to tease apart the effects without further experimentation. We look forward to exploring the roles of several components, for example: the use of IMELDA transformations for the speech models -- As suggested by the experiments above, the use of a speaker-independent IMELDA transformation, while unquestionably improving the recognition performance, may be removing important clues about language differences. To take advantage of the recognition boost without sacrificing language~rich information, the best strategy may be to perform an initial recognition pass using IMELDA models to generate a transcription, but to score the transcript using non-IMELDA models - a two-pass strategy similar to that used in our speaker identification work. Indeed, we may want to use a &quot;language sensitive&quot; IMELDA transformation - i.e. one which chooses directions in acoustic parameter space most useful in distinguishing languages rather than phonemes - for the scoring pass, in much the same way that we employed a &quot;speaker sensitive&quot; IMELDA in our work on speaker classification.</Paragraph> <Paragraph position="2"> * acoustic normalization of recognition scores -- In the case of one-recognizer tests, it seems important to have some way of controlling for such sources of variation as speaker differences, which should play no role in language discrimination. The acoustic normalization described above is a simple attempt to achieve this. However, like the speaker-independent IMELDA, it may also be removing important information about which regions of acoustic space different languages inhabit.</Paragraph> <Paragraph position="3"> * recognition quality -- It is our experience from our topic and speaker work that small improvements in recognition performance can yield enormous gains in classification tasks. However, in order to take advantage of the contextual information available in large vocabulary CSR, the recognition must exceed a certain minimal level of performance. We believe that none of our telephone speech recognition systems yet achieve the recognition levels needed to demonstrate the advantage of large vocabulary CSR as a language classification engine, but that this minimal level is well within reach.</Paragraph> <Paragraph position="4"> We are now focussing attention on the task of improving our telephone speech recognizers. The Spanish recoguizer was constructed in under a week's time when the Spanish telephone data became available and could still profit from such simple measures as further training iterations. We also hope to introduce into our telephone recognition systems such improved features as phonetically-tied mixture modelling, now used routinely in our microphone speech recognizers. The task of recognizing natural speech appears to be much more difficult than recognizing read speech and may require new techniques to address the problems of speaking rate, word contraction or fragmentation, and non-speech events.</Paragraph> <Paragraph position="5"> Dragon's telephone data collection effort is also continuing. We hope to have at least five hours of recorded telephone speech in each of seven languages by the end of 1994 with further collections scheduled for next year. This will allow us to create parallel recognition systems in a number of languages and finally run a two- (or n-) recognizer test of language identification. In particular, we will be collecting English telephone data and look forward to building a new English telephone speech recognizer more directly analogous to our current Spanish system. It should be interesting to see how this new English recognizer and the SWITCHBOARD recognizer perform on each other's data.</Paragraph> <Paragraph position="6"> We believe that these improvements should allow us to achieve the strong language identification performance we anticipate, based on our earlier work on topic and speaker identification.</Paragraph> </Section> class="xml-element"></Paper>