File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1016_metho.xml
Size: 18,886 bytes
Last Modified: 2025-10-06 14:12:18
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-1016"> <Title>RECENT PROGRESS IN THE SPHINX SPEECH RECOGNITION SYSTEM</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> SPHINX is a large-vocabulary, speaker-independent, continuous speech recognition system based on discrete hidden Markov models (HMMs) with LPC-derived parameters. In order to deal with the problem of speaker independence, we added knowledge to these HMMs in several ways. We represented additional knowledge through the use of multiple codebooks. We also enhanced the recognizer with word duration modeling. In order to model co-articulation in continuous speech, we introduced the use of functionword-dependent phone models, and generalized triphone models.</Paragraph> <Paragraph position="1"> More recently, we have made considerable progress with the SPHINX System. We reformulated the generalized triphone clustering algorithm as a maximum-likelihood procedure, and carried out some experiments with generalized triphones. We also implemented and evaluated the modeling of function phrases, and between-word coarticulation modeling rising generalized triphones. The latter experiment reduced SPHINX's error rate by 24-44%. We modified the corrective training algorithm \[1\] for speaker-independent, continuous speech recognition. Corrective training reduced SPHINX's error rate by 20-24%.</Paragraph> <Paragraph position="2"> In this paper, we will describe all components of the SPHINX System, with emphasis on the recent improvements. The SPHaNX System has been described in \[2\] and \[3\]. Publications on the recent improvements will be forthcoming.</Paragraph> <Paragraph position="3"> On the 991-word DARPA resource management task, SPHINX achieved speaker-independent word recognition accuracies of 82% and 96%, with grammars of perplexity 991 and 60, respectively. Results with the 1988 and 1989 test data resulted in 78 and 76% without grammar, and 96% and 94% with the word pair grammar. null</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. Speech Representation </SectionTitle> <Paragraph position="0"> The speech is sampled at 16 KHz, and preemphasized with a filter of 1 - 0.97z -1. Then, a Hamming window with a width of 20 msec is applied every 10 msec. Autocorrelation analysis with order 14 is followed by LPC analysis with order 14. Finally, 12 LPC-derived cepstral coefficients are computed from the LPC coefficients, and these LPC cepstral coefficients are transformed to a mel-scale using a bilinear transform. null These 12 coefficients are vector quantized into a codebook of 256 prototype vectors. In order to incorporate additional speech parameters, we created two additional codebooks. One codebook is vector quantized from differential coefficients. The differential coefficient of frame n is the difference between the coefficient of frame n+2 and frame n-2. This 40 msec.</Paragraph> <Paragraph position="1"> difference captures the slope of the spectral envelope.</Paragraph> <Paragraph position="2"> The other codebook is vector quantized from energy and differential energy values.</Paragraph> </Section> <Section position="5" start_page="0" end_page="125" type="metho"> <SectionTitle> 3. Context-Independent HMM Training </SectionTitle> <Paragraph position="0"> SPHINX is based on phonetic hidden Markov models.</Paragraph> <Paragraph position="1"> We identified a set of 48 phones, and a hidden Markov model is trained for each phone. Each phonetic HMM contains three discrete output distributions of VQ symbols. Each distribution is the joint density of the three codebook pdf's, which are assumed to be independent.</Paragraph> <Paragraph position="2"> The use of multiple codebooks was introduced by Gupta, et al. \[4\].</Paragraph> <Paragraph position="3"> We initialize our training procedure with the TIMIT phonetically labeled database. With this initialization, we use the forward-backward algorithm to train the parameters of the 48 phonetic HMMs. The training corpus consists of 4200 task-domain sentences spoken by 105 speakers. For each sentence, word HMMs are constructed by concatenating phone HMMs. These word HMMs are then concatenated into a large sentence HMM, and trained on the corresponding speech. Because the initial estimates are quite good, only two iterations of the forward-backward algorithm are run. This training phase produces 48 context-independent phone models. In the next two sections, we will discuss the second Iraining phase for context-dependent phone models.</Paragraph> </Section> <Section position="6" start_page="125" end_page="125" type="metho"> <SectionTitle> 4. Function Word/Phrase Dependent </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="125" end_page="125" type="sub_section"> <SectionTitle> Models </SectionTitle> <Paragraph position="0"> One problem with continuous speech is the unclear articulation of function words, such as a, the, in, of, etc. Since the set of function words in English is limited and function words occur frequently, it is possible to model each phone in each function word separately. By explicitly modeling the most difficult sub-vocabulary, recognition rate can be increased substantially. We selected a set of 42 function words, which contained 105 phones. We modeled each of these phones separately.</Paragraph> <Paragraph position="1"> We have found that function words are hardest to recognize when they occur in clusters, such as that are in the. The words are even less clearly articulated, and have strong inter-word eoarticulatory effects. In view of this, we created a set of phone models specific to function phrases, which are phrases that consist of only function words. We identified 12 such phrases, modified the pronunciations of these phrases according to phonological rules, and modeled the phones in them separately. A few examples of these phrases are: is the, that are, and of the.</Paragraph> </Section> </Section> <Section position="7" start_page="125" end_page="126" type="metho"> <SectionTitle> 5. Generalized Triphone Models </SectionTitle> <Paragraph position="0"> The function-word and function-phrase dependent phone models provide better representations of the function words. However, simple phone models for the non-function words are inadequate, because the realization of a phone crucially depends on context. In order to model the most prominent contextual effect, Schwartz, et al. \[5\] proposed the use of triphone models. A different triphone model is used for each left and right context. While triphone models are sensitive to neighboring phonetic contexts, and have led to good results, there are a very large number of them, which can only be sparsely trained. Moreover, they do not take into account the similarity of certain phones in their affect on other phones (such as /\]a/ and /p/ on vowels).</Paragraph> <Paragraph position="1"> In view of this, we introduce the generalized triphone model. Generalized triphones are created from triphone models using a clustering procedure: 1. An HMM is generated for every triphone context. null 2. Clusters of triphones are created; initially, each clusters consists of one triphone.</Paragraph> <Paragraph position="2"> 3. Find the most similar pair of clusters which represent the same phone, and merge them.</Paragraph> <Paragraph position="3"> 4. For each pair of same-phone clusters, consider moving every element from one to the other.</Paragraph> <Paragraph position="4"> 1. Move the element if the resulting configuration is an improvement.</Paragraph> <Paragraph position="5"> 2. Repeat until no such moves are left.</Paragraph> <Paragraph position="6"> 5. Until some convergence criterion is met, go to step 2.</Paragraph> <Paragraph position="7"> To determine the distance between two models, we use the following distance metric:</Paragraph> <Paragraph position="9"> where D (a, b) is the distance between two models of the same phone in context a and b. Pa (/) is the output probability of codeword i in model a, and N a (i) is the count of codeword i in model a. m is the merged model by adding N a and N b. In measuring the distance between the two models, we only consider the output probabilities, and ignore the transition probabilities, which are of secondary importance.</Paragraph> <Paragraph position="10"> Equation 1 measures the ratio between the probability that the individual distributions generated the training data and the probability that the combined distribution generated the training data. Thus, it is consistent with the maximum-likelihood criterion used in the forward-backward algorithm. This distance metric is equivalent to, and was motivated by, entropy clustering used in \[6\] and \[7\].</Paragraph> <Paragraph position="11"> This context generalization algorithm provides the ideal means for finding the equilibrium between trainability and sensitivity. Given a fixed amount of training data, it is possible to find the largest number of trainable detailed models. Armed with this technique, we could attack any problem and find the &quot;right&quot; number of models that are as sensitive and trainable as possible. This is illustrated in Figure 1, which shows that the optimal number of models increases as the training data is increased.</Paragraph> </Section> <Section position="8" start_page="126" end_page="127" type="metho"> <SectionTitle> 6. Between-Word Coarticulation Modeling </SectionTitle> <Paragraph position="0"> Triphone and generalized triphone models are powerful subword modeling techniques because they account for the left and right phonetic contexts, which are the principal causes of phonetic variability. However, triphone-based models consider only intra-word context. For example, in the word speech (/s p iy ch/), both left and right contexts for /p/ and /iy/ are known, while the left context for /s/and the right context for /ch/ are a special symbol for &quot;word boundary&quot;. However, in continuous speech, a word-boundary phone is strongly affected by the phone beyond the word boundary. This is especially true for short function words like the or a.</Paragraph> <Paragraph position="1"> A simple extension of triphones to model between-word coarticulation is problematic because the number of triphone models grows sharply when between-word triphones are considered. For example, there are 2381 within-word triphones in our 991-word task. But there are 7057 triphones when between-word triphones are also considered.</Paragraph> <Paragraph position="2"> Therefore, generalized triphones are particularly suitable for modeling between-word coarticulation. We first generaated 7057 triphone models that accounted for both intra-word and inter-word triphones. These 7057 models were then clustered into 1000 generalized triphone models. The membership of each generalized triphone is retained, so that inter-word contextual constraints can be applied during training and recognition. The main change in the training algorithm is in the construction of the sentence model. Two connections are now needed to link two words together. The first uses the known context to connect the appropriate triphones, and the second allows for the possibility of a between-word silence. In that case, a silence context is used. Figure 2 illuslxates the word boundary network of two words, where word w 1 consists of phones A, B, and C, and word w 2 consists of D, E, and F.</Paragraph> <Paragraph position="3"> phones A, B, and C, and word w z consists of D, E, and F. P(L,R) represents a phone P with left-context phone L and right-context phone R.</Paragraph> <Paragraph position="4"> For words with only one or two phones, sentence model concatenation is more complex. If w 2 is pronounced (D E), then both D(C,E) and D(SIL,E) must be further forked into E(D,X) and E(D,SIL), where X is the first phone of the next word. This is even more complicated when several one-phone and two-phone words are concatenated. To reduce the complexity of the pronunciation graph of a sentence, we introduce dummy states to merge transitions whose expected contexts are the same.</Paragraph> <Paragraph position="5"> The recognition algorithm must be modified because words may now have multiple begining and ending phones. Figure 3 illustrates the connection between two words during recognition. Like the training phase, the two words are connected both directly and through a silence. If one or both of the triphones has not occurred in the training data, we use the context-independent phone (or monophone) instead. Therefore, the direct connection between two words could be embodied in one of four forms: C D E) to another (V W X Y Z) in recognition. null * triphone to triphone.</Paragraph> <Paragraph position="6"> * triphone to monophone.</Paragraph> <Paragraph position="7"> * monophone to triphone.</Paragraph> <Paragraph position="8"> * monophone to monophone.</Paragraph> <Paragraph position="9"> The modeling of between-word coarticulation reduced SPHINX's error rate by 24-44%, for different test sets and grammars. More details about our implementation and results can be found in \[8\].</Paragraph> </Section> <Section position="9" start_page="127" end_page="128" type="metho"> <SectionTitle> 7. Corrective Training </SectionTitle> <Paragraph position="0"> Bahl et al. \[1\] introduced the corrective training algorithm for HMMs as an alternative to the forward-backward algorithm. While the forward-backward algorithm attempts to increase the probability that the models generated the training data, corrective training attempts to maximize the recognition rate on the training data. This algorithm has two components: (1) error-correction learning--which improves correct words and suppresses misrecognized words, (2) reinforcement learning --which improves correct words and suppresses near-misses. Applied to the IBM speaker-dependent isolated-word office correspondence task, this algorithm reduced the error rate by 16% on test data and 88% on training data. This improvement, while significant, suggests that corrective training becomes overly specialized for the training data.</Paragraph> <Paragraph position="1"> In this study, we extend the corrective and reinforcement learning algorithm to speaker-independent, continuous speech recognition. Speaker independence may present some problems, because corrections appropriate for one speaker may be inappropriate for another. However, with a speaker-independent task, it is possible to collect and use a large training set. More training provides not only improved generalization but also a greater coverage of the vocabulary. We also use cross-validation to increase the effective training data size. Cross-validation partitions the training data and determines misrecognitions using models trained on different partitions. This simulation of actual recognition leads to more realistic misrecognitions for error correction.</Paragraph> <Paragraph position="2"> Extension to continuous speech is more problematic.</Paragraph> <Paragraph position="3"> With isolated-word input, both error-correcting and reinforcement training are relatively straighforward, since all errors are simple substitutions. Bahl, et al.</Paragraph> <Paragraph position="4"> \[1\] determined both misrecognized words and near-misses by matching the utterance against the entire vocabulary. However, with continuous speech, the errors include insertions and deletions. Moreover, many substitutions appear as phrase-substitutions, such as home any for how many. These problems make reinforcement learning difficult. We propose an algorithm that hypothesizes near-miss sentences for any given sentence. First, a dynamic programming algorithm is used to align each correct sentence with the corresponding misrecognized sentence in the cross-recognized training set to produce an ordered list of likely phrase substitutions. Since simple text-to-text alignment would not be sensitive to sub-word and sub-phone similarities, we used a frame-level distance metric.</Paragraph> <Paragraph position="5"> This list of phrase substitutions are then used to randomly hypothesize near-miss sentences for reinforcement learning.</Paragraph> <Paragraph position="6"> Our experiments with corrective and reinforcement learning showed that our modifications led to a 20% error-rate reduction without grammar (72% on training set), and a 23% reduction with grammar (63% on training set). This demonstrated that increased training, both through speaker-independent data collection and through cross-validation, narrowed the gap between the results from training and testing data. Furthermore, this showed that our extension of the IBM corrective training algorithm to continuous speech was successful.</Paragraph> <Paragraph position="7"> More details about this work are described in \[9\] and \[10\].</Paragraph> <Paragraph position="8"> 8. Summary of Training Procedure The SPHINX training procedure operates in three stages. In the first stage, 48 context-independent phonetic models are trained. In the second stage, the models from the first stage are used to initialize the training of context-dependent phone models, which could be generalized triphone models and/or the function word/phrase dependent models. Since many parameters in the context-dependent models were never observed, we interpolate the context-dependent model parameters with the corresponding context-independent ones. We use deleted interpolation \[11\] to derive appropriate weights in the interpolation. The third and final stage uses corrective training to refine the dis- null cnminatory ability of the models. The SPHINX Ixaining procedure is shown in Figure 4.</Paragraph> <Paragraph position="9"> For recognition, we use a Viterbi search that finds the optimal state sequence in a large HMM network. At the highest level, this HMM is a network of word HMMs, arranged according to the grammar. Each word is instantiated with its phonetic pronunciation network, and each phone is instantiated with the corresponding phone model. Beam search is used to reduce the amount of computation.</Paragraph> <Paragraph position="10"> One problem with HMMs is that they do not provide very good duration models. We incorporated word duration into SPHINX as a part of the Viterbi search.</Paragraph> <Paragraph position="11"> The duration of a word is modeled by a univariate Gaussian distribution, with the mean and variance estimated from a supervised Viterbi segmentation of the training set. By precomputing the duration score for various durations, this duration model has essentially no overhead.</Paragraph> </Section> class="xml-element"></Paper>