File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1066_metho.xml

Size: 14,957 bytes

Last Modified: 2025-10-06 14:13:07

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1066">
  <Title>RAPID MATCH TRAINING FOR LARGE VOCABULARIES</Title>
  <Section position="4" start_page="0" end_page="328" type="metho">
    <SectionTitle>
2. REVIEW OF THE
RAPID MATCH MODULE
</SectionTitle>
    <Paragraph position="0"> The main job of the rapid match module is to provide the recognizer with a short list of words that may begin at any particular time by looking at speech data beginning at that time and extending only a brief period into the future. To accomplish this, we first construct &amp;quot;smooth frames&amp;quot; of speech by taking a (possibly weighted) average of several frames of acoustic data. For our continuous speech recognition, we have been using three smooth frames of information, each obtained by averaging together four successive 20-millisecond frames of speech. Such smooth frames have the dual benefit of condensing the acoustic information into a much smaller number of parameters and doing so in a way that reduces the sensitivity to potential variation in phoneme duration. The number of speech frames used in calcu- null lating a smooth frame, the number of smooth frames, and the offset from one smooth frame to the next are all adjustable parameters in the rapid match module.</Paragraph>
    <Paragraph position="1"> As the smooth frames are computed, they are scored against models for word start clusters, which are groups of words whose beginnings are acoustically similar.</Paragraph>
    <Paragraph position="2"> These word start groups are formed automatically using a specialized clustering algorithm starting from smooth models for the words in the vocabulary. Clearly, this clustering of words into acoustically-similar groupings a step performed during the rapid match training - results in further efficiencies at recognition time. Each word start cluster is represented by a sequence of probability distributions, one for each smooth frame of the model. We currently assume that each probability density is a product of double exponential distributions, one corresponding to each of the smoothed acoustic parameters. Thus each smooth frame of a word start model is determined by a collection of (mean, deviation)-pairs.</Paragraph>
    <Paragraph position="3"> We reduce run-time calculations still further by allowing several word start clusters to share the same probability densities for some of their smooth frames. This second level of clustering, like the first, is performed automatically as part of the training process and results in a collection of &amp;quot;position clusters&amp;quot; used for the spelling of all word start groups.</Paragraph>
    <Paragraph position="4"> Each word of the vocabulary may belong to several different word start clusters, depending on the context in which the word finds itself. We currently generate four models for each word, based on whether the word emerges from silence or speech and whether it is followed by silence or speech. The number of smooth frames representing a word start group is determined by the lengths of its members. In our current implementation, most words have models filling all three smooth frames, but some very short words (most commonly function words like &amp;quot;the&amp;quot;, &amp;quot;to&amp;quot;, and &amp;quot;of' when embedded in continuous speech) receive models with fewer frames.</Paragraph>
    <Paragraph position="5"> During recognition, as smooth frames are generated from incoming acoustic data, they are scored against the various word start clusters using the negative log likelihood for the probability models for each group. The score for a word start group is computed as an average over the scores from each of the smooth frames in its model. For every word start group scoring within a certain threshold, the words belonging to the group are looked up, possible duplicates are removed, and a language model score for each word is added to its word start score. The list of all words whose combined score falls within a second threshold is then passed on to the recognizer for a more complete analysis.</Paragraph>
    <Paragraph position="6"> For more detaiis on the rapid match module, consult \[1\].</Paragraph>
  </Section>
  <Section position="5" start_page="328" end_page="329" type="metho">
    <SectionTitle>
3. BUILDING BETTER MODELS
</SectionTitle>
    <Paragraph position="0"> The process of creating word start groups begins from sample tokens for the words in the recognizer's vocabulary. The speech frames are averaged together into smooth frames, just as in the rapid match recognition process, and these smoothed versions are then clustered into word start groups.</Paragraph>
    <Paragraph position="1"> Until now, this process began from a single token representing the &amp;quot;average&amp;quot; behavior of each word. Dragon's word models are built up from basic building blocks called phonemes-in-context, or PICs. The representative tokens used by the rapid matcher were constructed by concatenating PIC tokens built by means of a linear alignment routine. Through linear stretching and shrinking operations, examples of the desired phoneme were normalized to a common length and then the acoustic parameters averaged together on a frame-by-frame basis. (See \[2\] for a more detailed description of PIC models and the construction of aligned tokens.) Unfortunately, in the course of alignment, any usable information about the variability of frame parameters is lost. Although the models formed in this way were sufficient for a task like the marnrnography study, the strategy suffers from three main deficiencies: Because each word there is no way to parameters. Such during adaptation model is based on a single token, measure the variability of model estimates must be incorporated of the models.</Paragraph>
    <Paragraph position="2"> Because the token is constructed from a linear alignment of phonemic units, the model rigidly expects a particular phoneme in a particular frame and so is relatively intolerant of variation in phoneme duration. While the alignment process involves blending different behaviors within the phonemic unit, the representation does not allow for mixing frames involving different PICs. Averaging together several successive acoustic frames to create the &amp;quot;smooth frames&amp;quot; used in rapid match softens this effect, but cannot eliminate it.</Paragraph>
    <Paragraph position="3"> Finally, because the token is based on the reference speaker's models, extensive adaptation is necessary to adjust the model parameters to other speakers.</Paragraph>
    <Paragraph position="4"> And while adaptation can successfully modify values for the (mean, deviation)-pairs representing word start clusters, it cannot alter the spelling of word start clusters by position clusters nor the assignment of words to word start groups. Both of these steps  are performed once and for all based On the reference speaker's models.</Paragraph>
    <Paragraph position="5"> Our new method for building rapid match models overcomes these difficulties by working directly from HMMs representing the words for each speaker's vocabulary. In the new rapid match training, we begin from the phonemic spelling of each word and, using the speaker's own models, unpack the sequence of nodes representing each PIC. We then generate a collection of sample tokens by simulated traversals of this node sequence. At each node, we determine the duration of the stay by a random draw from a double exponential duration distribution and then, for each of the resulting number of frames, generate parameter values by independent draws from the output distribution for the node. The resulting collection of sample tokens exhibits all the variability one would expect to see in actual occurrences of the word. These tokens are then converted to their smoothed forms, the smoothed versions averaged together smooth frame by smooth frame to obtain both means and deviations for the new word model, and the usual clustering algorithm can then be followed.</Paragraph>
    <Paragraph position="6"> Of course, the sample tokens generated by independent draws from the output distributions are probably not themselves accurate representations of actual word occurrences; we would expect a high degree of correlation between successive frames in actual speech. But because these samples are processed through two rounds of averaging - the first combining successive acoustic frames into a single smooth frame and the second averaging smooth frames from the many sample tokens - we expect the resulting means to be fairly well estimated. On the other hand, our assumption of independence of frames probably leads to an underestimate of the true frame deviations. For example, in the extreme (and purely hypothetical) case that the four successive acoustic frames were in fact identical in actual speech, our random draws would underestimate the deviations by a factor of two.</Paragraph>
    <Paragraph position="7"> In general, we expect to be off by a considerably smaller factor, but we have found that performance of our new models is improved if we scale up all our estimated deviations by a factor in the range 1.3-1.5.</Paragraph>
  </Section>
  <Section position="6" start_page="329" end_page="330" type="metho">
    <SectionTitle>
4. INITIAL RESULTS ON THE
WALL STREET JOURNAL TASK
</SectionTitle>
    <Paragraph position="0"> Our goal is to ensure that the correct word candidate is returned by the rapid matcher in the list of the top 100-200 words. We do not require that it be the highest ranked - the recognizer will do the hard work of analyzing the top candidates in detail - but it is essential that the correct candidate not be excluded from this analysis.</Paragraph>
    <Paragraph position="1"> Therefore, our evaluation of the new rapid match training program concentrates on performance in this range.</Paragraph>
    <Paragraph position="2"> In order to assess how close we've come to meeting our goal, we have been using an evaluation package which ranks the word candidates nominated by the rapid matcher in any given speech frame. By running the recognizer in a mode where it knows the correct transcription for a text, we can obtain a segmentation of each utterance, marking the frame in which each word is most likely to begin. We then use the evaluation package to look at what rank the correct word has in the list of candidates passed on to the recognizer in that frame.</Paragraph>
    <Paragraph position="3"> To provide an initial reading on the new rapid match training and to help set clustering thresholds, we first looked at its performance on the mammography task.</Paragraph>
    <Paragraph position="4"> While we did not expect the new routine to improve noticeably on our earlier performance - it was, after all, a relatively easy task involving a limited vocabulary and recorded by our reference speaker - it was reassuring to find that the new routine, like the old, returned the correct word in the list of the top 100 candidates over 99% of the time for a test set roughly 4300 words long, and by the top 200 words, the correct candidate failed to appear on the list only about 1 time in 1000.</Paragraph>
    <Paragraph position="5"> We then moved on to the more challenging Wall Street Journal task. Here we built new rapid match models for the 5K verbalized punctuation vocabulary for 5 of our 12 speakers, ranging from our worst performer to our best, and compared them to the original models which had already been adapted to each speaker. (For a description of our overall performance on the Wall Street Journal task, see the companion article \[3\].) The results are summarized in Table 1, which reports what percent of the time the correct word was included in the word candidate list returned by rapid match, as a function of the length of the list. The test sets involved about 40 sentences totaling somewhat over 700 words per speaker.</Paragraph>
    <Paragraph position="6"> They were drawn from the 5K verbalized punctuation speaker-dependent Wall Street Journal corpus. In all cases the new models improved significantly over the old, usually cutting the error rate by 25-50%.</Paragraph>
    <Paragraph position="7"> Although the new training method obviates the need for adaptation of models, we were curious about whether adaptation would further improve the performance of the rapid match system. We therefore have begun experimenting with adapting our new rapid match models.</Paragraph>
    <Paragraph position="8"> Preliminary results indicate that we can expect to gain about another percentage point improvement even after a single round of adaptation. A sample is given in Table 2, for speaker 00A.</Paragraph>
    <Paragraph position="9"> We have also begun building new rapid match models for  task.</Paragraph>
    <Paragraph position="10"> the 20K vocabulary. Results for a sampling of speakers on the 20K task are given in Table 3. Clearly the difficulty of the rapid match task grows significantly with vocabulary size. However, it should be noted that while the job of creating sufficiently good models grows enormously as the vocabulary grows, the burden at recognition time does not: the number of word start clusters grows much more slowly than the vocabulary size both because we allow the clustering thresholds to increase gradually with vocabulary size and because large vocabularies permit more sharing of cluster models. For example, the number of word start clusters for the mammography task (with a vocabulary of 860 words) was about 1500, for the 5K Wall Street Journal task about 5000 clusters, and for the 20K vocabulary about 6000 clusters. (Recall that each word is given four contextdetermined models, so the actual number of word models is four times the vocabulary size.) A word should be said about the relationship between results on these evaluation tests and actual recognition performance. We have found that even if a word has a poor rank in the frame in which the recognizer ideally expects the word to begin, a good score in a neighboring frame will often allow the recognizer to get the word  match models for 00A.</Paragraph>
    <Paragraph position="11"> right. On the other hand, if a word fails to be passed on to the recognizer within a small window around the optimal word start, performance will suffer. Being deprived of the correct word, the recognizer is forced to follow a false path through the web of sentence hypotheses, usually resulting in two or three word errors. Thus, even small improvements to the rapid match module can have a significant impact at recognition time. As an example of the relationship between the rapid match evaluation results and actual recognition performance, Table 4 gives rapid match results for both old and new training models for our in-house speaker SAL on the Wall Street Journal 5K test set, along with word error rates in the related recognition tests.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML