File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1078_intro.xml

Size: 3,813 bytes

Last Modified: 2025-10-06 14:05:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1078">
  <Title>TECHNIQUES TO ACHIEVE AN ACCURATE REAL-TIME LARGE- VOCABULARY SPEECH RECOGNITION SYSTEM</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
ABSTRACT
</SectionTitle>
    <Paragraph position="0"> In addressing the problem of achieving high-accuracy real-time speech recognition systems, we focus on recognizing speech from ARPA's 20,000-word Wall Street Journal (WSJ) task, using current UNIX workstations. We have found that our standard approach--using a narrow beam width in a Viterbi search for simple discrete-density hidden Markov models (HMMs)--works in real time with only very low accuracy. Our most accurate algorithms recognize speech many times slower than real time.</Paragraph>
    <Paragraph position="1"> Our (yet unattained) goal is to recognize speech in real time at or near full accuracy.</Paragraph>
    <Paragraph position="2"> We describe the speed/accuracy trade-offs associated with several techniques used in a one-pass speech recognition framework: * Trade-offs associated with reducing the acoustic modeling resolution of the HMMs (e.g., output-distribution type, number of parameters, cross-word modeling) * Trade-offs associated with using lexicon tree,s, and techniques for implementing full and partial bigram grammars with those trees * Computation of Gaussian probabilities are the most time-consuming aspect of our highest accuracy system, and techniques allowing us to reduce the number of Gaussian probabilities computed with little or no impact on speech recognition accuracy.</Paragraph>
    <Paragraph position="3"> Our results show that tree-based modeling techniques used with appropriate acoustic modeling approaches achieve real-time performance on current UNIX workstations at about a 30% error rate for the WSJ task. The results also show that we can dramatically reduce the computational complexity of our more accurate but slower modeling alternatives so that they are near the speed necessary for real-time performance in a multipass search. Our near-future goal is to combine these two technologies so that real-time, high-accuracy large-vocabulary speech recognition can be achieved.</Paragraph>
    <Paragraph position="4"> (WSJ) speech corpus. All of the speed and performance data given in this paper are results of recognizing 169 sentences from the four male speakers that comprise ARPA's November 1992 20,000-word vocabulary evaluation set. Our best performance on these data is 8.9% (10.3% using bigram language models). Our standard implementation for this system would run approximately 100 times slower than real time. 1 Both these systems use beam-search techniques for finding the highest-scoring recognition hypothesis.</Paragraph>
    <Paragraph position="5"> Our most accurate systems are those that use HMMs with genonic mixtures as observation distributions \[3\]. Genonic mixtures sample the continuum between fully continuous and fled-mixture HMMs at an arbitrary point and therefore can achieve an optimum recognition performance given the available training data and computational resources. In brief, genonic systems are similar to fully continuous Gaussian-mixture HMMs, except that instead of each state having its own set of Gaussian densities, states are clustered into genones that share these Gaussian codebooks. Each state, however, can have its own set of mixture weights used with the Gaussian codebook to form its own unique observation distribution. All the genonie systems discussed in this paper use a single 39-dimensional observation composed of the speech cepstrum and its first and second derivatives, and the speech energy and its first and second derivatives. All Gaussians have diagonal covariance matrices.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML