File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1078_metho.xml

Size: 17,896 bytes

Last Modified: 2025-10-06 14:13:48

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1078">
  <Title>TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. MODELING TRADE-OFFS
</SectionTitle>
    <Paragraph position="0"> The speed/accuracy trade-off of our speech recognition systems can be adjusted in several ways. The standard approaches are to adjust the beam width of the Viterbi search and to change the</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="395" type="metho">
    <SectionTitle>
1. INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Our techniques for achieving real-time, high-accuracy large-vocabulary continuous speech recognition systems focus on the task of recognizing speech from ARPA's Wall Street Journal i. We define real-time systems as those that process 1 second of speech per second.  instance, that eliminating cross-word modeling can significantly improve the speed of our recognition systems at about a 20% cost in word error. In this table, lattice speed refers to recognition accuracy when decoding from precomputed word lattices \[8\]. That is, this is only performing a subset of the search. Actual full grammar recognition time could be from a factor of 3 to an order of magnitude higher. However, it is useful to compare relative lattice decoding speeds from the various techniques.</Paragraph>
    <Paragraph position="1"> A technique frequently used at SRI to achieve relatively fast speech recognition demonstrations is to downgrade our acoustic modeling by implementing a discrete density (VQ) HMM system without cross-word acoustic modeling. This system is then searched using a Viterbi search with a small beam width. Table 2 shows the fuU grammar speed accuracy trade-off when modifying the beam width if a Silicon Graphics Incorporated 2 (SGI) UNIX workstation with a 150-MHz MIPS R4400 CPU 3 is used to  Speed/accuracy trade-off for a beam search perform the computation.</Paragraph>
    <Paragraph position="2"> We have found that this technique gives an unsatisfactory speed/ accuracy trade-off for this task and we have investigated other techniques as described below.</Paragraph>
    <Paragraph position="3">  2. All product names mentioned in this paper are the trademark of their respective holders.</Paragraph>
    <Paragraph position="4"> a. This workstation scores 85 and 93 for the SPECint92 and SPECfp92 benchmarks. For our tests it is roughly 50% faster than an SGI R4000 Indigo, and 50% faster than a SPARC 10/51. It should be between 1/2 and 2/3 the speed of an HP735. SGI R4400 systems cost about  $12,000.</Paragraph>
    <Paragraph position="5"> We explored the use of lexicon trees as a technique for speeding up the decoding times for all modeling techniques. Lexicon trees represent the phonetics of the recognition vocabulary as a tree instead of as a list of pronunciations (lists of phones). With a tree representation, words starting with the same phonetic units share the computation of phonetic models. This technique has been used by others, including the IBM \[10\], Phillips \[7\], and CMU groups, and it is also currently used at LIMSI. Because of the large amount of sharing, trees can drastically reduce the amount of computation required by a speech recognition system. However, lexicon trees have several possible drawbacks: * Phonetic trees are not able to use triphone modeling in all positions since the right phonetic context of a node in a tree can be ambiguous.</Paragraph>
    <Paragraph position="6"> * One cannot implement an admissible Viterbi search for a single lexicon tree when using a bigram language model, because the word being decoded (w2 in the bigram equation P(w2/wl)) may not be known until a leaf in the tree is reached--long after certain Viterbi decisions are typically made.</Paragraph>
    <Paragraph position="7"> The first concern can be addressed by replicating nodes in the free to disambiguate triphone contexts. However, even this may not be necessary because the large majority of right contexts in the tree are unambiguous (that is, most nodes have only one child). This is shown in Table 3, where the concentrations of triphone and biphone models are compared for tree- and linear-lexicon schemes.</Paragraph>
    <Paragraph position="8">  and without lexicon trees The second concern, the ability to model bigram language models using an admissible search strategy, is a problem. As shown in Table 4, moving from a bigram to a unigrarn language model more than doubles our error rate. Ney \[7\] has proposed a scheme where lexicon trees are replicated for each bigram context. It is possible that this scheme would generalize to our application as well. For the three recognition systems in Table 2, on average 7, 13, and 26 different words end each frame. This is the minimum average number of copies of the lexicon tree that the system would need to maintain.</Paragraph>
    <Paragraph position="9">  We have decided to pursue a different approach, which is shown in the figure below. We refer to this technique as approximate bigram trees.</Paragraph>
    <Paragraph position="11"> In an approximate bigram tree, the aim is to model the salient portion of the backed-off bigrarn language model \[11\] in use. In an approximate bigram tree, a standard lexicon tree (incorporating unigram word probabilities) is combined with a bigram section that maintains a linear (non-tree) representation of the vocabulary.</Paragraph>
    <Paragraph position="12"> Bigram and backoff language model transitions are added to the leaves of the tree and to the word-final nodes of the bigram section. 4 When the entire set of bigram is represented, then this network implements a full backed-off bigram language model with an efficient tree-based backoff section. In fact, for VQHMM systems, this scheme halves our typical decoding time for little or no cost in accuracy. Typically, however, we need further reduction in the computational requirement. To achieve this we represent only a subset of the group of bigram transitions (and adjust the backoff probabilities appropriately). This degrades the accuracy of our original bigram language model, but reduces its computational requirements. The choice of which bigrarns to represent is the key design decision for approximate bigram trees. We have experimented with four techniques for choosing bigrarn subsets to see which make the best speed/accuracy trade-offs:  trees.</Paragraph>
    <Paragraph position="13"> The top two lines of the table show that the bigram language model improves perforrnanee from 42.3% word error to 21.5% as compared with a unigram language model. The rest of the table shows how approximate bigram trees can trade off the performance and speed of the bigram model. For instance, in several techniques shown--such as prob 2&amp;--that maintain more than half of the benefit of bigrarns for little computational cost, CPU usage goes from 0.6 to 0.8, when the error rate goes from 42.3% to 29.2%. The rest of the improvement, reducing the error rate from 29.2% to 21%, increases the required computation rate by a factor of four.</Paragraph>
    <Paragraph position="14"> Table 4 also shows that the number of bigrams represented does not predict the computation rate.</Paragraph>
    <Paragraph position="15"> 4. In the actual implementation, word-final nodes in the bigrarn section are merged with their counterparts in the tree so that the bigram transitions need be represented only once. For simplicity, however, we show the system with two sets of bigram probabilities.</Paragraph>
    <Paragraph position="16">  The square root of the perplexity of these language models seems to predict the recognition performance as shown in Table 5.</Paragraph>
  </Section>
  <Section position="5" start_page="395" end_page="396" type="metho">
    <SectionTitle>
4. REDUCING GAUSSIAN
COMPUTATIONS
</SectionTitle>
    <Paragraph position="0"> SRI's most accurate recognition systems, using genonie mixtures, require the evaluation of very large numbers of Gaussian distributions, and are therefore very slow to compute. The baseline system referenced here uses 589 genonic mixtures (genones), each with 48 Gaussian distributions, for a total of 28,272 39-dimensional Gaussians. On ARPA's November 1992 20,000-word Evaluation Test Set, this noncrossword, bigram system performs at 13.43% word error. Decoding time from word lattices is 12.2 times slower than real time on an R4400 processor. Full grammar decoding time would be much slower. Since the decoding time of a genonic recognition system such as this one is dominated by Gaussian evaluation, one major thrust of our effort to achieve real-time recognition has been to reduce the number of Gaussians requiring evaluation each frame. We have explored three methods of reducing Gaussian computation: Gaussian clustering, Gaussian shortlists, and genonic approximations.</Paragraph>
    <Section position="1" start_page="395" end_page="395" type="sub_section">
      <SectionTitle>
4.1. Gaussian Clustering
</SectionTitle>
      <Paragraph position="0"> The number of Gaussians per genone can be reduced using clustering. Specifically, we used an agglomerative procedure to cluster the component densities within each genone. The criteria that we considered were an entropy-based distance and a generalized likelihood-based distance \[6\]. We found that the entropy-based distance worked better. This criterion is the continuous-density analog of the weighted-by-counts entropy of the discrete HMM state distributions, often used for clustering HMM state distributions \[5\], \[3\].</Paragraph>
      <Paragraph position="1"> Our experiments showed that the number of Gaussians per genone can be reduced by a factor of three by first clustering and then performing one additional iteration of the Baum-Welch algorithm as shown in Table 6. The table also shows that clustering followed by additional training iterations gives better accuracy than directly training a system with a smaller number of Gaussians (Table 6, Baseline2). This is especially true as the number of Gaussians per genone decreases.</Paragraph>
    </Section>
    <Section position="2" start_page="395" end_page="396" type="sub_section">
      <SectionTitle>
Gaussians
4.2. Gaussian Shortlists
</SectionTitle>
      <Paragraph position="0"> We have developed a method for eliminating large numbers of Gaussians before they are computed. Our method is to build a &amp;quot;Gaussian shorflist&amp;quot; \[2\], \[4\], which uses vector quantization to subdivide the acoustic space into regions, and lists the Gaussians worth evaluating within each region. Applied to unclustered genonic recognition systems, this technique has allowed us to reduce by more than a factor of five the number of Gaussians considered each frame. Here we apply Gaussian shortlists to the clustered system described in Section 4.1. Several methods for generating improved, smaller Gaussian shorflists are discussed and applied to the same system.</Paragraph>
      <Paragraph position="1"> Table 7 shows the word error rates for shortlists generated by a variety of methods. Through a series of methods, we have reduced the average number of Gaussian distributions evaluated for each genone from 18 to 2.48 without compromising accuracy. The various shortlists tested were generated in the following ways: * None: No shortlist was used. This is the baseline case from the clustered system described above. All 18 Gaussians are evaluated whenever a genone is active.</Paragraph>
      <Paragraph position="2"> * 12D-256: Our original shortlist method was used. This method uses a cepstral vector quantization codebook (12dimensions, 256 codewords) to partition the acoustic space. With unclustered systems, this method generally achieves a 5:1 reduction in Gaussian computation. In this clustered system, only a 3:1 reduction was achieved, most likely because the savings from clustering and Gaussian shortlists overlap.</Paragraph>
      <Paragraph position="3"> * 39D-256: The cepstral codebook that partitions the acoustic space in the previous method ignores 27 of the 39 feature dimensions. By using a 39-dimensional, 256codeword VQ eodebook, we created better-differentiated acoustic regions, and reduced the average shortlist length to 4.93.</Paragraph>
      <Paragraph position="4"> * 39D-4096: We further decreased the number of Gaussians per region by shrinking the size of the regions. Here we used a single-feature VQ codebook with 4096 eodewords.</Paragraph>
      <Paragraph position="5">  For such a large codebook, vector quantizafion is accelerated using a binary tree VQ fastmateh.</Paragraph>
      <Paragraph position="6"> 39D-4096-minl: When generating a Gaussian shortlist, certain region/genone pairs with low probabilities are assigned very few or even no Ganssians densities. When we were using 48 Gaussians/genone, we found it important to ensure that each list contains a minimum of three Gaussian densities. With our current clustered systems we found that we can achieve similar recognition accuracy by ensuring only one Gaussian per list. As shown in Table 7, this technique results in lists with an average of 2.48 Gaussians per genone, without hurting accuracy.</Paragraph>
      <Paragraph position="7">  a variety of Gaussian shortlists Thus, with the methods in Sections 4.1 and 4.2, we have used clustering, retraining, and new Gaussian shortlist techniques to reduce computation from 48 to an average of 2.48 Gaussians per genone without affecting accuracy.</Paragraph>
    </Section>
    <Section position="3" start_page="396" end_page="396" type="sub_section">
      <SectionTitle>
4.3. Genonic Approximations
</SectionTitle>
      <Paragraph position="0"> We have successfuUy employed one other method for reducing Gaussian computation. For certain pairs of genones and acoustic regions, even the evaluation of one or two Gaussian distributions may be excessive. These are cases where the output probability is either very low or very uniform across an acoustic region. Here a uniform probability across the region (i.e., requiring no Gaussian evaluations) may be sufficient to model the output probability.</Paragraph>
      <Paragraph position="1"> To provide these regional flat probabilities, we implemented a discrete-density HMM, but one whose output probabilities were a region-by-region approximation of the probabilities of our genonic system. Since the two systems' outputs are calibrated, we can use them interchangeably, using a variety of criteria to decide which system's output to use for any given frame, state, acoustic region, or hypothesis. This technique, using variable resolution output models for HMMs is similar to what has been suggested by Alleva et al. \[1\].</Paragraph>
      <Paragraph position="2"> We train this genonic approximation by averaging, for each acoustic region, the output of each genone across a set of observations. The resulting system can be used either by itself or in combination with the continuous system from which it was trained.</Paragraph>
      <Paragraph position="3"> Table 8 shows the performance of the discrete approximate genone systems as a function of the number of regions used.</Paragraph>
      <Paragraph position="4">  Even with 16384 acoustic regions, the discrete genonic approximation has an error rate of 18.40%, compared with the baseline continuous system at 13.64%. However, when these discrete systems are used selectively in combination with a continuous genonie system, the results are more encouraging. Our most successful merger combines the 4096-region discrete approximation system (20.32% error) with the 39D-4096-minl genone system from Table 7 (13.50% error). In combining the two, instead of ensuring that a single Gaussian density was available for all shortlists, the genonic approximation was used for cases where no densities existed. In this way, we were able to eliminate another 25% of the Gaussian computations, reducing our lattice-based computation burden to 564 Gaussians per frame, with a word error of 13.36%.</Paragraph>
      <Paragraph position="5"> In summary, we started with a speech recognition system with 28,272 Gaussian distributions that computed 14,538 Gaussian distributions per frame and achieved a 13.43% word error rate running 12.2 times slower than real time on word lattices. Using the techniques described in Section 4, we have reduced the system's computational requirements to 564 Gaussians per frame, resulting in a system with word error of 13.36%, running at 2.0 times real time on our word lattices.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="396" end_page="397" type="metho">
    <SectionTitle>
5. MULTIPASS APPROACHES
</SectionTitle>
    <Paragraph position="0"> The techniques for improving the speed of single-pass speech recognition systems can be combined to achieve other speed/ accuracy trade-offs (e.g., trees using genone systems with reduced Gaussian computation rates). Furthermore, with multipass approaches \[8,9\] many of these techniques can be used independently as the different passes of the speech recognition system. For instance, a discrete density tree search may be used in a lattice building or a forward pass, and a Gaussian system may be used in the lattice and/or backward passes.</Paragraph>
    <Paragraph position="1"> We have performed preliminary evaluations of several of the tree-based systems presented in Section 3 to evaluate their performance as forward passes for a forward-backward search \[9\]. Preliminary results show that forward tree-based systems with 30% word error would add at most 3% to the word error rate of a fuU accuracy backward pass (i.e., at most increase the error rate from  approximately 10% to approximately 13%). More detail on this work wiU be presented at the HLT conference and will be included in the final version of this paper.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML