File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1010_metho.xml

Size: 24,684 bytes

Last Modified: 2025-10-06 14:12:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1010">
  <Title>The BBN BYBLOS Continuous Speech Recognition System</Title>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
BBN BYBLOS Continuous Speech Recognition system.
</SectionTitle>
    <Paragraph position="0"> The BYBLOS system uses context-dependent hidden Markov models of phonemes to provide a robust model of phonetic coarticulation. We provide an update of the ongoing research aimed at improving the recognition accuracy. In the first experiment we confirm the large improvement in accuracy that can be derived by using spectral derivative parameters in the recognition. In particular, the word error rate is reduced by a factor of two. Currently the system achieves a word error rate of 2.9% when tested on the speaker-dependent part of the standard</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1000-Word DARPA Resource Management Database
</SectionTitle>
    <Paragraph position="0"> using the Word-Pair grammar supplied with the database.</Paragraph>
    <Paragraph position="1"> When no grammar was used, the error rate is 15.3%.</Paragraph>
    <Paragraph position="2"> Finally, we present a method for smoothing the discrete densities on the states of the HMM, which is intended to alleviate the problem of insufficient training for detailed phonetic models.</Paragraph>
    <Paragraph position="3"> i. Introduction At BBN we have been involved in the development of Spoken Language Systems for almost two decades. As part of DARPA's Speech Understanding Research Program from 1971-1976, we developed a system that integrated continuous speech recognition with natural language understanding in a 1000-word travel management task; we call the system HWIM (Hear What I Mean). As part of another DARPA program, we have been working since 1982 on a more advanced speech recognition system based on using Hidden Markov Models. The result of this work is the BYBLOS Continuous Speech Recognition System.</Paragraph>
    <Paragraph position="4"> The basic algorithms used in the BYBLOS Continuous Speech Recognition system have been described in several papers \[1, 2, 3\]. In Section 2 we give a brief review of the techniques currently used in the BYBLOS system. The two features that have made the largest improvements in recognition accuracy since 1982 were the use of robust context-dependent phonetic models, and the addition of derivative spectral parameters in multiple codebooks. Each of these features used separately reduces the recognition error rate by a factor of two. Taken together, they reduce the error rate by a factor of four. In Section 3 we present the latest recognition results for the BYBLOS system. In particular, we compare the recognition results with and without spectral derivative parameters. We also demonstrate, by testing the system on training data, that the recognition accuracy is likely to improve as more training data is made available. Since several similar systems have provided test results on this database it is possible to determine the benefits of particular algorithms. In particular, we compare the error rate for using discrete densities with that using continuous densities. We also compare the recognition accuracy for speaker-dependent models with that for speaker-independent models derived from a large number, of speakers. Finally, in Section 4, we present a method for smoothing the discrete densities on the states of the HNIM. The smoothing is intended to alleviate the problem of insufficient training for detailed phonetic models.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="94" type="metho">
    <SectionTitle>
2. The BYBLOS system
</SectionTitle>
    <Paragraph position="0"> The BYBLOS system uses context-dependent hidden Markov models (HMM) of phonemes to provide a robust model of coarticulation \[ 1, 2\]. Each phoneme is typically modeled as a HMM with three states that correspond roughly to the acoustics of the beginning, middle, and end of the phoneme. To model the acoustic coarticulation between phonemes, we define a separate HMM for each phoneme in each of its possible contexts. Since many of these phonetic contexts do not occur frequently enough to allow robust estimation of model parameters, we interpolate the detailed context-dependent phonetic models with models of the same phoneme that are dependent on less context. In this way we derive the benefit of word-based models for words with sufficient training and the generality of phoneme-based models for the rest. For example, we use triphone models that depend jointly on the preceding and following phonemes, we use diphone models that depend separately on the preceding or following  context, and we use context-independent models that are pooled across all instances of the phoneme. We have also experimented with models of the phoneme that depend on the particular word that the phoneme is in \[3\]. We average the probabilities of the different context-dependent models with weights that depend on the state within the phoneme and on the number of occurrences of each type of context in the training set.</Paragraph>
    <Paragraph position="1"> With each state of the HMM we associate a conditional probability density of the spectral features given that state. The basic spectral features are reel-scaled cepstral coefficients (MFCC)\[4\] and the log of the normalized total energy. We derive the MFCC by warping the log power spectrum of each frame of speech before computing the cepstrum (by inverse Fourier transform). A portion of the training set of MFCC vectors is clustered to produce a codebook of spectral prototypes \[5\]. We typically use a codebook with 256 prototypes. Then for each frame we find the index of the nearest vector quantizer (VQ) prototype. The discrete probability density is therefore represented as a vector of 256 numbers indicating the probability of each VQ index given the state.</Paragraph>
    <Paragraph position="2"> The decoding algorithm used in the BYBLOS system has been described in \[2\]. The algorithm is a time-synchronous beam search for the most likely sequence of words, given the observed speech parameters. The algorithm is similar to the commonly used Viterhi algorithm with the exception that, when performing the state update within a word, the probability of being in a particular state is derived from the sum of the probabilities at each of the preceding states. This is contrasted with the standard Viterbi algorithm, in which we use the maximum over the preceding states. This algorithm more nearly computes the correct likelihood function for each sequence of words and was found to result in a small but consistent improvement over the standard Viterbi algorithm. As with the Viterbi algorithm, the search can be constrained by any f'mite-state grammar. It has also been used in a top-down search using context-free grammars.</Paragraph>
    <Section position="1" start_page="94" end_page="94" type="sub_section">
      <SectionTitle>
Multiple Codebooks
</SectionTitle>
      <Paragraph position="0"> As shown by Furui \[6\], even though the sequence of spectral parameter vectors may be sufficient to reproduce a reasonable facsimile of the original speech, it is beneficial to explicitly include the derivatives of the spectral parameters in the recognition algorithm. To avoid problems associated with trying to estimate probability densities of large dimensional spaces, we use a separate VQ codebook and probability distribution for the steady state and derivative parameter sets. We multiply the probabilities for the different parameter sets as if they were independent \[7l. During the past year we modified the BYBLOS system to use multiple sets of features.</Paragraph>
      <Paragraph position="1"> Currently, the BYBLOS system uses three sets of spectral features: 14 reel-scale cepstral coefficients, the 14 derivatives of these parameters (computed as the derivative of a linear fit to 5 successive frames), and a third set containing the normalized total energy and the derivative of the energy.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="94" end_page="94" type="metho">
    <SectionTitle>
3. Results
</SectionTitle>
    <Paragraph position="0"> In this section we present the recognition results for the BYBLOS system under several different conditions.</Paragraph>
    <Paragraph position="1"> But first, we describe the database and the testing procedure used for all the results in this paper.</Paragraph>
  </Section>
  <Section position="6" start_page="94" end_page="98" type="metho">
    <SectionTitle>
DARPA Resource Management Database
</SectionTitle>
    <Paragraph position="0"> Most of the recent research with the system has been performed using the standard 1000-word DARPA Resource Management Database \[8\]. Tests were performed on the speaker-dependent portion of the database which contains the speech of 12 speakers. The training set for each speaker consists of 600 sentences averaging eight words or three seconds in length, for a total of about 30 minutes of speech. There are two test sets of 100 sentences each. The first test set is designated as &amp;quot;development test&amp;quot; and was distributed by the National Bureau of Standards for formal tests. 25 sentences from 8 of the 12 speakers were distributed in October, 1987; 25 different sentences from all 12 speakers were distributed in May, 1988. After these two formal tests, all I00 of the development test sentences were released for development purposes. The remaining 100 test sentences, which were designated as &amp;quot;evaluation test&amp;quot;, were also divided into 25 sentence groups and are being distributed in a similar manner for further formal testing. The first set of 25 was distributed for February, 1989. In the remainder of this paper, we will refer to the different formal test sets by their date of distribution, (e.g.</Paragraph>
    <Paragraph position="1"> the Oct. '87 test set, etc.).</Paragraph>
    <Paragraph position="2"> In addition to the speech data itself, the database also contains a specification of two grammars to be used for testing and testing procedures to be used to assure that results from different research sites can be compared.</Paragraph>
    <Paragraph position="3"> Recognition runs typically are performed using an artificial &amp;quot;Word-Pair&amp;quot; grammar with perplexity 60 that allows all pairs of word classes that appear in the database, and with no grammar or perplexity 1000. The recognized strings ate automatically aligned to the tree word strings, and the number of word substitutions, insertions, and deletions are computed. The standard single-number measure of performance is the total word error, which is defined as %error = 100 substitutions + deletions + insertions total words When sentence error rate is quoted, (typically only when using a grammar) it is defined as the percentage of sentences with any error at all.</Paragraph>
    <Section position="1" start_page="95" end_page="98" type="sub_section">
      <SectionTitle>
Multiple Codebook Results
</SectionTitle>
      <Paragraph position="0"> We compared the recognition accuracy when the system used 14 steady state cepstral coefficients with that when it used three codebooks (including derivative and energy parameters). The comparison was made on the Oct.</Paragraph>
      <Paragraph position="1"> '87 test set of 8 speakers under the two grammar conditions. Table 1 below contains the results of this comparison. As can be seen, the use of derivative (and energy) information reduced the error rate by about a factor of two under both grammar conditions.</Paragraph>
      <Paragraph position="2">  with steady state parameters vs 3 codebooks with added derivative and energy parameters.</Paragraph>
      <Paragraph position="3"> The results given above for 3 codebooks were development results, in that the t~st set had been used several times. Therefore, we present below in Table 2 the results of testing the system on all 12 speakers on the May '88 and Feb '89 test sets for the first time, using the same phonetic word models as used above.</Paragraph>
      <Paragraph position="4"> The average word error rates were 3.4% and 2.9% when the Word-Pair grammar was used, and 16.2% and 15.3% when no grammar was used. The difference  Grammar for May '88 and Feb '89 Test Sets between the error rates of the two test sessions is only marginally significant especially given the variation between speakers. Table 3 below shows the detailed results for the February '89 test sets for each of the 12 speakers. The table gives the percent substitution, deletion, and insertion errors, in addition to the total word and sentence error rates.</Paragraph>
      <Paragraph position="5">  the Twelve Speakers on the Feb '89 Test Set.</Paragraph>
      <Paragraph position="6"> Results are given with and without the Word-Pair grammar. For each condition and speaker, the table shows the percent substitution, deletion, and insertion errors, and total word error. Percent sentence error is also given for the Word-Pair grammar.</Paragraph>
      <Paragraph position="7"> Test on Training It is frequently instructive to measure the recognition performance of a system when it is tested on data that was included m the training set. In Table 4 below we compare the word and sentence recognition error rate when the system is tested on the training set versus when it is tested on an independent test set. The same acoustic models were used in both cases. Results are given for the Word-Pair grammar only.</Paragraph>
      <Paragraph position="8"> As can be seen, when the system is tested on training data, the error rates are very small. This large difference in performance indicates that there is not enough training data for the number of free parameters we have in our phonetic models. Therefore, we might expect that recognition accuracy would improve considerably as we add more training data.</Paragraph>
      <Paragraph position="9">  Comparison of Methods Several other research groups have also reported their recognition results on this same database. Since, in many cases, the algorithms differ in only one or two aspects, it is possible to identify differences in performance with particular aspects of a system. In this section, we attempt to make two such comparisons: discrete vs continuous densities, and speaker-dependent vs speaker-independent models. The comparisons are made on the results provided for the May '88 test set because the different systems were most similar for this test set. We note that each of these systems has evolved since testing on this particular test set, and as a result their results have improved considerably as can be seen in the results presented for those systems elsewhere in this volume.</Paragraph>
      <Paragraph position="10"> The continuous speech recognition system developed at MIT Lincoln Labs by Doug Paul uses Gaussian probability densities to represent each of the states of the HMM instead of the discrete densities used in BYBLOS. In most other respects, the two systems are quite similar. The recognition accuracy for the speaker-dependent test on the May '88 test set was 5.5%, as compared with 3.4% for the BYBLOS system. It would appear, then, that continuous HMM densities do not necessarily provide improved results over discrete densities.</Paragraph>
      <Paragraph position="11"> Another comparison of interest is the relative performance of speaker-dependent models versus speaker-independent models. While it is clear that, for any given duration of training, a speaker-dependent model (trained for the particular speaker using the system) should always result in much higher recognition accuracy, the practical question remains, &amp;quot;How much more training does a speaker-independent system need to give the same accuracy as a speaker-dependent system?&amp;quot; Three systems that are almost identical to BYBLOS have been used on the speaker-independent portion of the Resource Management Database. Two different training sets have been used in the tests on the May '88 test set: one with 72 different speakers containing 2880 sentences, and a larger one with 105 speakers containing 4200 sentences. The test data used was the same as for the speaker-dependent test described above.</Paragraph>
      <Paragraph position="12"> When trained on 72 speakers the word error rate with the Word-Pair grammar was 10.1% for the Sphinx system of Carnegie Mellon University, I 1.4% for the Decipher system of Stanford Research Institute, and 13.1% for the Lincoln Labs system. The Sphinx system and the Decipher system both use discrete densities similar to those used in BYBLOS. When trained on 105 speakers, the error rates for Sphinx and the Lincoln system were 8.9% and 10.1% respectively. Thus, the BYBLOS system with speaker-dependent training with five to seven times less training data has roughly 1/2 to 1/3 the error rate of the speaker-independent trained systems. It would be interesting to fred out how much additional speech is needed for speaker-independent training to result in the same performance as  30-minute speaker-dependent training.</Paragraph>
      <Paragraph position="13"> 4. Robust Smoothing for  Much of the research in speech recognition is devoted to improving the structure of the statistical model of speech. Frequently, improving the model involves increasing the complexity or dimensionality of the model. For example, we use context-dependent phonetic models, which increases the number of models. We add features, such as spectral derivatives, which increases the dimensionality of the feature space. We use a non-parametric probability density function (pdf) to have flexibility in the model, but we lose the benefit of the compactness of a parametric model. Each of these improvements comes with an increase in the effective number of degrees of freedom in our model.</Paragraph>
      <Paragraph position="14"> Unfortunately, more training data is needed to estimate reliably the increased number of free parameters.</Paragraph>
      <Paragraph position="15"> Conversely, faced with a fixed amount of training data, we must limit the number of free parameters or else our &amp;quot;improvements&amp;quot; will not be realized.</Paragraph>
      <Paragraph position="16"> As described above the BBN BYBLOS Continuous Speech Recognition system uses discrete nonparametric pdfs of context-dependent phonetic models. Most of these pdfs are trained with only a few tokens of speech (typically between 1 and 1'0). These discrete distributions work surprisingly well, given the small amount of training. However, they are certainly prone to the problem of spectral types that do not appear in the training set for a given model, but are, in fact, likely to occur for that model. The results presented in Table 4 in Section 2 indicate that there is a large difference in recognition rate when the system is tested on the training data and on independent test data. Therefore, we tried to find a smoothing algorithm that would reduce the number of probabilities that are low purely due to a lack of training. Below we describe a general smoothing method based on using a probabilistic smoothing matrix \[9\].</Paragraph>
      <Paragraph position="17"> For each state of a discrete HMM, we have a discrete probability density function (pdf) defined over a fixed set, N, of spectral templates. For example, in the BYBLOS system we typically use a vector quantization (VQ) codebook of size N=256 \[5\]. The index of the closest template is referred to below as the VQ index or the spectral bin. We can view the discrete pdf for each state s as a probability row vector</Paragraph>
      <Paragraph position="19"> where P(kils) is the probability of spectral template k i at  state s. We can imagine that the probabilities of different spectra are related in that, for each spectrum that has a high probability for a given lxtf, there are several other spectra that are also likely to have high probabilities. These might be &amp;quot;nearby&amp;quot; spectra, or they might just be statistically related. We represent this relation by p(kj4ki), the probability that if spectrum k i occurs, the spectrum kj will occur also. The set of probabilities p(k~4k i) for all i and j form an NxN smoothing matrix, T, where Tij = p(k14ki).</Paragraph>
      <Paragraph position="20"> If we multiply the original pdf row vector p(s) by the smoothing matrix, we get a smoothed pdf row vector.</Paragraph>
      <Paragraph position="22"> In our experiments we use a separate smoothing matrix for each phoneme. This matrix is combined with the phoneme-independent matrix to ensure robustness.</Paragraph>
      <Paragraph position="23"> The amount of training available for different models varies considerably, from one or two tokens for the majority of the triphone-dependent models to hundreds of tokens for the more common models. Clearly, we don't want to smooth a model as much ff it was estimated from a large number of training tokens. Therefore we recombine the smoothed pdf above with the original pdf using a weight w(s) that depends on the number of training tokens of the model. Thus the final pdf used is given by</Paragraph>
      <Paragraph position="25"> The weight w is made proportional to the log of the number of training tokens, N T.</Paragraph>
      <Paragraph position="27"> This equation is illustrated in Figure 1.</Paragraph>
      <Paragraph position="28">  function of the number of training tokens, N T Estimating the Matrix We have tried three techniques for estimating the smoothing matrix: Parzen smoothing, self adaptation cooccurrence smoothing, and triphone cooccurrence smoothing. These methods were presented in a talk at Arden House in May 1988 and are described in detail in \[10\]. Since the third method worked best in our initial experiments, we will discuss only that method.</Paragraph>
      <Paragraph position="29"> After performing forward-backward training, we have a large number of context-dependent phonetic models. Most of these (about 2,500) are triphone-dependem models. Each model has three different pdfs. These models contain a record of all of the VQ-index spectra that occurred for one part (one state) of a particular triphone. Thus, according to the Markov model, these spectra freely cooccur. For each pdf of each triphone model we count all permutations of two VQ spectra in that pdf, weighted by their probabilities and by the number of training tokens of the model. Figure 2 illustrates this process for one pdf of one model  Estimation. pdf shown results in matrix increments shown.</Paragraph>
      <Paragraph position="30"> For example the pdf shown has VQ indices 27, 112, and 198 with probabilities 0.3, 0.5, 0.2 respectively. The model occurred 20 times in the training set. Therefore, we add 0.3 * 0.5 * 20 = 3.0 to entries (27,112) and (112.27) m the matrix. As with the second method, we keep a separate matrix for each phoneme and one phoneme-independent matrix. Each row is normalized to create probabilistic matrices. A method similar to this was developed independently by Lee \[11\]. However, in his method there was only one smoothing matrix, instead of one for each phoneme, and he estimated the matrix from context-independent models instead of triphone-dependent models. We believe that these differences result in too much smoothing.</Paragraph>
      <Paragraph position="31"> Recognition experiments using the word-pair grammar were performed with and without triphone cooccurrence smoothing on all three test sets. These results are shown below in Table 5.</Paragraph>
      <Paragraph position="32">  Recognition System. As expected, we found that adding the derivative and energy parameters in separate codebooks reduced the error rate by a factor of two, relative to using the steady state spectral parameters alone. The resulting 6. word error rate was 3.4% and 2.9% on two successive formal tests. We presented an algorithm for smoothing discrete probability densities when the training is insufficient. However the algorithm provided only a small gain in recognition accuracy when 30 minutes were 7. available for training. The HMM systems based on nonparametric discrete densities resulted in higher accuracy than the system that used continuous densities, leaving open the question of whether it is harmful to quantize the spectral parameters. The error rate of the speaker-dependent system when trained with 30 minutes of speech 8. was less than half that of similar speaker-independent systems trained on over 100 speakers with five to seven times the amount of speech.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML