XML Viewer - h94-1087

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1087_metho.xml
Size: 17,211 bytes
Last Modified: 2025-10-06 14:13:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1087">
  <Title>Language Identification via Large Vocabulary Speaker Independent Continuous Speech Recognition</Title>
  <Section position="3" start_page="0" end_page="438" type="metho">
    <SectionTitle>
2. THEORETICAL FRAMEWORK
</SectionTitle>
    <Paragraph position="0"> We briefly review the theoretical background described in our earlier papers \[3\] and \[4\]. Our approach to the message classification problem - for topic, speaker, or language identification - is based on modelling speech as a stochastic process. We assume that a given stream of speech is generated by one of several possible stochastic sources, one corresponding to each of the languages (or topics or speakers) in question. We are faced with the problem of deciding, based on the acoustic data alone, which is the true source of the speech.</Paragraph>
    <Paragraph position="1"> Standard statistical theory provides us with the optimal solution to such a classification problem. We denote the string of acoustic observations by A and introduce the random variable T to designate which stochastic model has produced the speech, where T may take on the values from 1 to n for the n possible speech sources. If we let Pi denote the prior probability of stochastic source i and assume that all classification errors have the same cost, then we should choose the source T = ~ for which = argmax Pi P(A \] T = i). i We assume, for the purposes of this work, that all prior probabilities are equal, so that the classification problem reduces simply to choosing the source i for which the conditional probability of the acoustics given the source is maximized.</Paragraph>
    <Paragraph position="2">  In principle, to compute each of the probabilities</Paragraph>
    <Paragraph position="4"> In practice, such a collection of computations is unwieldy so to limit the computational burden we introduce a simplifying approximation. Instead of computing the full probability P(A \[ T = i), we approximate the sum by its largest term: the joint probability of A and the single most probable word sequence W = W/max. Of course, generating such an optimal word sequence is exactly what speech recognition is designed to do. Thus, we could imagine running n different speech recognizers, one trained in each of the n languages, and then compare the resulting probabilities P(A, W/m~ \[ T = i) corresponding to each of the n optimal transcriptions W/max. The speech would then be assigned to the language whose recognizer produced the best score.</Paragraph>
    <Paragraph position="5"> This approach still requires us to make multiple recognition passes across the test speech, one pass for each stochastic source. In the cases of topic and speaker identification studied earlier, we were able to further limit the demand on the recognizer by producing a single &amp;quot;best&amp;quot; transcription W = Wm~,, using a speaker-independent topic-independent recognizer, to approximate the optimal transcriptions produced by each of the stochastic sources T = i. The corresponding probabilities P(A, Wmax \[ T = i) were then computed by rescoring this &amp;quot;best&amp;quot; transcription using either topic-specific language models in the case of topic identification, or speaker-specific acoustic models for speaker identification. (See the above-cited articles for further details.) For the problem of language identification, we do not have the option of obtaining a single &amp;quot;languageindependent&amp;quot; transcription: the transcription depends inextricably on the language we are recognizing. Thus it would appear that in this case we are forced to run several recognition passes on each test utterance, one for each language in question, or at the very least perform the recognition using a recognizer capable of running several sets of models/languages simultaneously and allow the best performing language to threshold hypotheses from poorer performing ones.</Paragraph>
    <Paragraph position="6"> We are currently working to develop parallel recognizers trained on telephone-quality speech in a number of languages which should allow us to perform exactly this experiment. This effort is described in more detail below. While this development effort is under way, we are exploring the possibility of performing two-language discrimination using a single recognizer trained in one of the languages. There are several ways of using the theory above to construct a one-recognizer test. Using two recognizers, one trained in each language, we would estimate P(A I T = i) for each of the two languages and then, as described above, assign the speech sample to the recognizer producing the best score. Alternatively, we could look at the log likelihood ratio</Paragraph>
    <Paragraph position="8"> and make the assignment based on a threshold S = So, assigning the sample to language #1 if S &gt; So, and to language #2 otherwise. With only one recognizer, trained, say, in language #1, we could simply impose a threshold on log P(A I T = 1) alone, assigning the speech sample to language #1 if the score was good enough and to language #2 otherwise. This naive solution suffers from a number of problems, most significantly that the recognition score depends on many variables unrelated to the language - such as speaker, channel, or phonetic content - that are not properly controlled for without the normalizing effect of the denominator in the likelihood ratio.</Paragraph>
    <Paragraph position="9"> In the experiments described below, we have explored the possibility of controlling for these confounding factors in the acoustics by introducing a normalization based on the acoustics of individual speech frames. In Dragon's speech recognition system the acoustics for each frame are represented in terms of a feature vector and the recognizer's acoustic models consist in part of probability distributions on the occurrence of these feature vectors. We refer to these models, the output distributions for nodes of our hidden Markov models, as PELs (for &amp;quot;phonetic elements&amp;quot;). In normal speech, the PEL sequences we expect to see are constrained by the phonemic sequences within the words in the recognizer's vocabulary, but as a group the PELs should provide good coverage of that region of acoustic parameter space where speech data lie. To normalize the recognition scores for the one-recognizer tests we compute a second score using, for each speech frame, the probability corresponding to whichever PEL model - unconstrained by word-level hypotheses - best matches the acoustics in that frame. The product of these frame-by-frame probabilities provides a second score (referred to below as the &amp;quot;maximal acoustic score&amp;quot;) that can be used as the denominator in the log likelihood ratio above. Presumably, when the speech being recognized is in the language of the recognizer, this optimal frame-by-frame PEL sequence should be reasonably close to the true PEL sequence, but when the language is different the maximal acoustic score should be far better than the score produced by the recognizer.</Paragraph>
    <Paragraph position="10">  This normalization using best-matching PELs captures some sense of how well the acoustic signal fits the recognizer's models independent of constraints imposed by the words in the language we are recognizing. Thus, we expect it to help minimize sources of variability unrelated to differences between languages. However, it may be that different languages cover somewhat different parts of acoustic parameter space and, as we shall see below, this normalization may also have the undesirable side effect of normalizing away this language-rich information as well.</Paragraph>
    <Paragraph position="11"> The scores produced by the language identification system are negative log probabilities, normalized by the number of frames in the utterance. Thus, in practice, the log likelihood ratio translates to a simple difference of recognizer (or recognizer and maximal acoustic) scores.</Paragraph>
  </Section>
  <Section position="4" start_page="438" end_page="439" type="metho">
    <SectionTitle>
3. INITIAL EXPERIMENTS
</SectionTitle>
    <Paragraph position="0"> Our approach to language identification depends crucially on the existence of a large vocabulary continuous speech recognition system, so in order to test the feasibility of our language identification strategy, we turned to our primary LVCSR system, the Wall Street Journal recognizer developed under the ARPA SLS program.</Paragraph>
    <Paragraph position="1"> This recognition system has been described extensively elsewhere (see, for example, \[5\] and \[6\]). We review its basic properties here.</Paragraph>
    <Paragraph position="2"> The recognizer is a time-synchronous hidden Markov model based system. It makes use of a basic set of 32 signal-processing parameters: 1 overall amplitude term, 7 spectral parameters, 12 mel-cepstral parameters, and 12 mel-cepstral differences. Our standard practice is to employ an IMELDA transform \[7\], a transformation constructed via linear discriminant analysis to select directions in parameter space that are most useful in distinguishing between designated classes while reducing variation within classes. For speaker-independent recognition we choose directions which maximize the average variation between phonemes while being relatively insensitive to differences within the phoneme class, such as might arise from different speakers, channels, etc. Since the IMELDA transform generates a new set of parameters ordered with respect to their value in discriminating classes, directions with little discriminating power between phonemes can be dropped. We used only the top 16 IMELDA parameters for speaker-independent recognition, divided into four 4-parameter streams. For speaker-independent recognition, we also normalize the average speech spectra across utterances via blind deconvolution prior to performing the IMELDA transform, in order to further reduce channel differences.</Paragraph>
    <Paragraph position="3"> Each word pronunciation is represented as a sequence of phoneme models called PICs (phonemes-in-context) designed to capture coarticulatory effects due to the preceding and succeeding phonemes. Because it is imprac.tical to model all the triphones that could in principle arise, we model only the most common ones and back off to more generic forms when a recognition hypothesis calls for a PIC which has not been built. The PICs themselves are modelled as linear HMMs with one or more nodes, each node being specified by an output distribution - the PELs referred to above - and a double exponential duration distribution. The output distributions of the states were modelled as tied mixtures of Gaussian distributions. The recognizer used for our language identification work was trained from the standard WSJ0 SI-12 training speakers (using 7200 sentences in all, totalling about 16 hours of speech data). Because Dragon's in-house recordings are made at 12 kHz, rather than the WSJ standard of 16 kHz, the training data was first down-sampled to 12 kHz before training the models. For these experiments, the standard WSJ 20K vocabulary and digram language model (based on about 40 million words of newspaper text) were used.</Paragraph>
    <Paragraph position="4"> For the language identification test material, three bilingual Dragon employees each recorded 20 English sentences taken from a current issue of the Wall Street Journal. For Spanish data, they read 20 Spanish sentences taken from the financial section of a current issue of America Economia, a Spanish language news magazine. The resulting test corpus thus consisted of 60 English and 60 Spanish utterances, averaging about 8 seconds in length and recorded on a Shure SM-10 microphone at a 12 kHz sample rate.</Paragraph>
    <Paragraph position="5"> Using the simple (unnormalized) one-recognizer strategy described above, we obtained an 83% probability of detection at the equal error point (i.e. the point where the probability of detection equals the probability of false alarm). After rescoring using the maximal acoustic score normalization, this figure improved to 95%. It is also worth noting that using the maximal acoustic score alone we obtained a result of 68%. Such a nonspeech-based strategy is similar in spirit to approaches to language identification using suhword acoustic features rather than full speech recognition. The results are summarized in the first line of Table 1.</Paragraph>
    <Paragraph position="6"> Inspired by the success of this initial trial on read speech data, we next turned to an assessment of performance on spontaneous telephone speech, using as test material speech drawn from the OGI corpus \[8\] of recorded telephone messages. This multi-lingual corpus contains &amp;quot;evoked monologues&amp;quot; from 90 speakers in each of ten languages. For our in-house testing, we selected 10 Spanish  Street Journal recognizer. The figures give the probability of detection at the equal error point for the recognizer score R, the maximal acoustic score M, and the normalized recognition score R- M.</Paragraph>
    <Paragraph position="7"> and 10 English calls from among the designated OGI training material. We used the &amp;quot;story-bt&amp;quot; segments of these calls, which run up to 50 seconds in length. Prior to testing, these were broken at pauses into shorter segments using an automatic &amp;quot;acoustic chopper&amp;quot;. This resulted in 102 English segments and 104 Spanish segments, each less than about 10 seconds in length.</Paragraph>
    <Paragraph position="8"> For our first foray into language discrimination on telephone speech, we used the same SWITCHBOARD speech recognition system used in our topic and speaker identification work. This recognizer was trained - and for topic and speaker identification, tested - on telephone conversations from the SWITCHBOARD corpus \[9\], collected by TI and now available through the Linguistic Data Consortium. Details of the recognizer are given in \[3\] and \[4\]; it is similar in structure to our Wall Street Journal recognizer, but was trained on only about 9 hours of conversational telephone speech. The recognition performance even on SWITCHBOARD data is very weak, although it is still capable of extracting sufficient information to achieve good topic and speaker identification performance. When used for language identification on OGI utterances the results were disappointing: it was unable to perform at anything better than chance levels, even aided by the acoustic normalization scoring.</Paragraph>
    <Paragraph position="9"> Dragon Systems is currently engaged in an effort to collect telephone data in a number of languages using an &amp;quot;evoked monologue&amp;quot; format similar to that used for the OGI corpus. Our first collection efforts focussed on Spanish data collection and, using about 3 hours of our own Spanish data and an additional 15 minutes of OGI Spanish training material, we built a rudimentary recognition system for Spanish telephone speech. It has a 5K vocabulary and a digram language model trained from 30 million words of Spanish newswire data.</Paragraph>
    <Paragraph position="10"> This new Spanish recognizer achieved a 72% probability of detection at the equal error point on the OGI test data when using the simple (unnormalized) recognition scores. In this case, unlike for the Wall Street Journal and SWITCHBOARD experiments, there was no advantage to using acoustic normalization techniques. Instead, using the maximal acoustic score for normalization actually degraded performance: probability of detection dropped to only 66% at the equal error point. Interestingly, for this system, the maximal acoustic score alone did as well as the regular recognition score: 74% probability of detection at the equal error point.</Paragraph>
    <Paragraph position="11"> We conjectured that this behavior might be due at least in part to the fact that the Spanish recognizer, unlike the Wall Street Journal and SWITCHBOARD recognizers, did not employ our usual speaker-independent IMELDA transformation. Recall that this transform is designed to emphasize differences between phoneme classes while minimizing differences within each class and so may well be suppressing language-informative phonetic distinctions. Acoustic normalization may help to overcome this deficiency, but may be unnecessary - or even counterproductive - with non-IMELDA models. To test this hypothesis, we re-ran the WSJ language identification test, but this time with models trained without the IMELDA transform. The results are reported in the second line of Table  1. Without IMELDA, the recognition itself was some null what less accurate, but the language identification performance using the recognizer scores was essentially unchanged. However, as expected, the performance of the maximal acoustic score alone improved without IMELDA, even though that performance remained well below that of the full recognizer, and there was a corresponding drop in the normalized score performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML