XML Viewer - h90-1017

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1017_metho.xml
Size: 16,450 bytes
Last Modified: 2025-10-06 14:12:30
<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1017">
  <Title>The Dragon Continuous Speech Recognition System: A Real-Time Implementation</Title>
  <Section position="3" start_page="0" end_page="78" type="metho">
    <SectionTitle>
2. System Description
</SectionTitle>
    <Paragraph position="0"> The architecture of the continuous speech recognition system is shown in Figure 1. The various components of this system are described below.</Paragraph>
    <Section position="1" start_page="0" end_page="78" type="sub_section">
      <SectionTitle>
2.1 Signal Processing
</SectionTitle>
      <Paragraph position="0"> An TMS32010-based board that plugs into a slot on the AT-bus performs analog-to-digital conversion and digital signal processing of the input speech waveform, and extracts spectral features used in recognition. Input speech is sampled at 12 KHz and lowpass filtered at 6 KHz. Eight spectral parameters are computed every 20 milliseconds, and are used as input to the HMM-based recognizer.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
2.2 Recognition
</SectionTitle>
      <Paragraph position="0"> The recognition search to find the most likely sentence hypothesis is based on the time-synchronous decoding algorithm that is used in almost all current CSR systems for this vocabulary size. In this algorithm, partial paths (representing incomplete sentential hypotheses) are extended synchronously using dynamic programming (DP), and all span the same length of the input signal, so that their path cost functions are directly comparable. To reduce recognition search, a beam pruning technique is applied to eliminate all paths that score poorly relative to the best path, and therefore would have very low probability of being the global best hypotheses that spans the entire utterance. We also explored another family of speech decoding algorithms, the stack decoder\[l\], in our recognizer. It is our conclusion at this time that at least for a task of this complexity, time-synchronous algorithms are considerably more efficient for finding a single most likely answer.</Paragraph>
    </Section>
    <Section position="3" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
2.3 Rapid Matcher
</SectionTitle>
      <Paragraph position="0"> An important component of the recognition search is the Rapid Matcher. In the time-synchronous decoding scheme, the Rapid Matcher helps reduce the search space dramatically by proposing to the HMM DP matcher at any given frame only a relative small number of word candidates that are likely to start at that frame. Only the words on this rapid match list (rather than the entire vocabulary) are considered for seeding a word for DP match. Since the Rapid Matcher is designed to take up considerably less computation than the DP Matcher, the combined rapid match/DP match recognition architecture results in an order of magnitude of savings in computation, with minimal loss in recognition accuracy. The rapid match algorithm is described in detail in \[2\].</Paragraph>
    </Section>
    <Section position="4" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
2.4 Training of Acoustic Models
</SectionTitle>
      <Paragraph position="0"> The research goal at Dragon is to build CSR systems for large vocabulary natural language tasks. As such, it is deemed impractical to use whole-word models to model the words in the vocabulary for recognition since in such a system, one must have training tokens (in different acoustic contexts) for every word in the vocabulary. Our solution then is to make extensive use of phonetic modeling for recognition.</Paragraph>
      <Paragraph position="1"> In general, the goal of acoustic modeling is to assure that when a string of the acoustic units, whatever they may be, are strung together according to the transcription of an utterance to generate a sequence of spectra, it would fairly accurately represent the actual sequence of speech spectra for that utterance. Towards this goal, we have chosen as the fundamental unit to be trained the &amp;quot;phoneme-in-context&amp;quot; (PIC), proposed in \[3\]. In the present implementation, a PIC is taken as completely specified by a phoneme accompanied by a preceding phoneme (or silence), a succeeding phoneme (or silence), stress level, and a duration code that indicates the degree of prepausal lengthening. To restrict the proliferation of PICs, syllable boundaries, even word boundaries, are currently ignored.</Paragraph>
      <Paragraph position="2"> During training, tokens are phonemically labeled by a semi-automatic procedure using hidden Markov models in which each phoneme is modeled as a sequence of one to six states. A model for a phoneme in a specific context is constructed by interpolating models involving the desired context and acoustically similar contexts.</Paragraph>
      <Paragraph position="3"> As each word in the vocabulary is spelled in terms of PICs, each PIC in turn is spelled in terms of allophonic acoustic segments, or clusters. An acoustic cluster consists of a mean vector and a variance vector. The construction of these clusters is done in a semi-supervised manner. Currently, the total number of acoustic clusters required to construct all PICs is only slightly more than 2000. As a result, the entire set of PICs can be adapted to a new user on the basis of a couple of thousand words of speech data.</Paragraph>
      <Paragraph position="4"> With this approach to acoustic modeling, we are able to model words reasonably well acoustically while maintaining to a large extent the desirable property of task-independence. By using different phonetic dictionaries (that make up words for each task), we have constructed models for a 30,000-word isolated word recognizer as well as for four different continuous speech tasks. Details of Dragon's acoustic modeling process can be found in \[4\].</Paragraph>
    </Section>
    <Section position="5" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
2.5 Task Description
</SectionTitle>
      <Paragraph position="0"> The Dragon application task consists of recognizing mammography reports. All the training and test material for this task have been extracted from a database of 1.3 million words of mammography text. This text corpus forms part of a 38.2 million word database of radiology text. Much of this text represents actual transcriptions of spoken reports.</Paragraph>
      <Paragraph position="1"> All of the test material described here is performed with an 842-word subvocabulary. Punctuation marks, digits, and letters of the alphabet were explicitly excluded. This vocabulary covers about 75% of the full mammography database, and 92% of the database without the excluded words. 6000 sentences (or sentence fragments) containing only these vocabulary words were extracted from the marnmography database. Half of these sentences was used for training, and the other half was set aside as test.</Paragraph>
    </Section>
    <Section position="6" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
2.6 Recognition Performance
</SectionTitle>
      <Paragraph position="0"> Using the system described above, we have obtained preliminary continuous speech recognition results for a 842-word mammography report task, a subset of a full radiology report task. A partial bigram language model was constructed from 40M words of radiology reports, 1M of which was specific to mammography. The bigram language model consisted of unigrams together with common bigrams and uncommon bigrams of common words. The perplexity of this task as measured on a set of 3000 sentences is 66.</Paragraph>
      <Paragraph position="1"> The result was measured on a single speaker, using 1000 test utterances totaling 8571 words. The total number of word errors was 293 (3.4% word error rate), with 205 substitutions, 62 insertions, and 26 deletions. The sentence error rate was 19.5%. The average number of words returned from the Rapid Marcher (per frame) was 48.</Paragraph>
      <Paragraph position="2"> A sample of the test sentences and associated recognition errors made are shown below.</Paragraph>
      <Paragraph position="3">  The patient returns for additional views for further evaluation The patient returns for additional view is for further evaluation We will be evaluating the system on several speakers. In addition, we are working on improving recognition performance, and we have very specific ideas about how that can be done.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="78" end_page="80" type="metho">
    <SectionTitle>
3. Real-time Implementation
</SectionTitle>
    <Paragraph position="0"> Our strategy in developing a prototype real-time continuous speech recognition system on the PC is to use a multitude of approaches to solve the computational problem. Since one of our primary concerns is software portability, extensive rewrites in assembly code is kept at a minimum. Instead, we kept almost all of the system written in C, and rely mostly on both algorithm and hardware improvements to achieve real time performance. Software optimizations include the use of a rapid match algorithm to reduce recognition search space, C code optimization, and writing assembly code for a few compute-intensive routines.</Paragraph>
    <Paragraph position="1"> With hardware, we are relying on the use of both faster machines (e.g., 486-based PC) and more hardware (off-theshelf boards) serving as compute engines to the PC.</Paragraph>
    <Section position="1" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
3.1 Algorithms/Software Implementations
Rapid match
</SectionTitle>
      <Paragraph position="0"> The single most important factor in achieving a real time implementation is the use of rapid match to reduce computation during recognition. As described earlier, rapid match is used to compute a relatively short list of likely word candidates that are likely to start at a given frame in the speech input. Thus instead of seeding the entire vocabulary (or close to it), only those words that are returned by the Rapid Matcher are seeded.</Paragraph>
      <Paragraph position="1"> Profile and optimize in C Alternatively, we also invested in profiling the recognition program and getting a report of the amount of time spent in each routine, sorted in decreasing order, so that the first routine on this profiling report is the most time consuming one. Then, if possible, a rewrite of this routine or parts of it with efficiency as the objective is performed. This is done for the top few routines on the list (which usually account for a significant percentage of the total computation). The entire procedure is then repeated.</Paragraph>
      <Paragraph position="2"> Assembly language code Once in a while, as deemed necessary and appropriate, an entire C routine is rewritten in assembly code. Currently, only a few routines have been rewritten this way, which are all routines of the Rapid Marcher.</Paragraph>
    </Section>
    <Section position="2" start_page="78" end_page="78" type="sub_section">
      <SectionTitle>
3.2 Hardware Implementations
</SectionTitle>
      <Paragraph position="0"> A second part of our strategy is to let advances in the technology of manufacturing PCs help in solving the computation problem in continuous speech recognition.</Paragraph>
      <Paragraph position="1"> Already, we have witnessed an order of magnitude increase in the computation power of a personal computer within the last decade (from AT running at 8 Mhz clock rate to 386 at 33 Mhz). Starting off this decade, the Intel 486-based family of PC's that have just been introduced are a factor 2 faster that its immediate predecessor (the 386-based) machines, given a fixed clock speed of 33 Mhz (see Table 1). This trend will be certain to continue, at least for the foible future. Our recognizer sped up by almost a factor of two just by going from a 386/33 to a 486/33, without any modification to the code (see Table 2). In fact, since the 486 instruction set is downward compatible, the exact same executable code that ran on the 386 also ran on the 486. At this rate, real-time very large vocabulary (&gt; 10,000 words) continuous speech recognition on the PC is within reach not too far in the future.</Paragraph>
    </Section>
    <Section position="3" start_page="78" end_page="80" type="sub_section">
      <SectionTitle>
3.3 Parallel Architecture
</SectionTitle>
      <Paragraph position="0"> We also explored the use of a single (but expandable to multiple) off-the-shelf board (29K-based coprocessor board) serving as compute engine to the PC, and performing the computation in parallel (a coarse grain 2-way parallelism).</Paragraph>
      <Paragraph position="1"> The board of our choice was an AMD 29000-based board (called the AT-Super made by YARC) that plugs directly onto the AT-bus on the backplane of the PC. The board is quoted at 17 MIPs, although our benchmark in running the recognizer on the board revealed a somewhat lower MIP number (see Table 1). The board also came with some software for development of programs to perform parallel computation.</Paragraph>
      <Paragraph position="2"> In analyzing the computation requirements of the various components of our algorithm, it was immediately apparent that a natural way to divide up the algorithm is to have the DP Matcher and the Rapid Matcher run on separate processors, for the following reasons. First, the two components are functionally and logically separable, making parallelization fairly straightforward. Second, it makes sense from the point of the view of the the hardware benchmarks (the two processors give equivalent number of MIPs) as the two recognition components take up within a factor of two relative to the other the number of CPU cycles. Lastly, the communication bandwidth is low (on the order of 5k bytes/sec), so that little overhead is incurred. In the next section, we present results using two alternate ways of mapping the component algorithms to the two processors.</Paragraph>
    </Section>
    <Section position="4" start_page="80" end_page="80" type="sub_section">
      <SectionTitle>
3.4 Recognition Benchmarks
</SectionTitle>
      <Paragraph position="0"> Table 2 shows the recognition benchmarks (measured in number of times real time) using the various hardware platforms. As can be seen, using a baseline 386 PC, we are at 2.8 X real time. Using a combined 386+29K architecture, and putting the Rapid Matcher on the host and DP Matcher on the 29K (RM/DM) gave us more than a factor of two improvement (to 1.3 X).</Paragraph>
      <Paragraph position="1"> Alternatively, going to a faster machine (486-based PC) immediately gave us almost a factor of two relative to running on the 386. However, using the combined 486+29K architecture, though putting us very close to real time (1.1 X), did not provide a significant gain over the 386+29K platform. This is due to the fact that the 29K board, in performing DP match, has become the computational bottleneck. Also, going to the alternative software architecture of performing DP match on the host and rapid match on 29K board (DM/RM) resulted in worse computational performance. This is largely explained by the fact that by performing the rapid match on the 29K, the computational gain that resulted from assembly coding (done for the 386) of some rapid match routines was now lost.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="80" end_page="80" type="metho">
    <SectionTitle>
3.5 Discussion
</SectionTitle>
    <Paragraph position="0"> Table 3 demonstrates how real-time recognition on the PC was achieved. As noted previously, the use of rapid match to reduce recognition search was the single most important factor in achieving real time. An order of magnitude reduction in computation was realized using this algorithm. Rewriting of C code with runtime efficiency in mind and assembly language coding of some time-critical rapid match routines resulted in factors of 2 and 1.5 speedups, respectively. Finally, making use of more MIPs (either with a 486-based PC or use of a single coprocessor board) gave an additional factor of two to three, depending on the exact hardware platform used. In short, by combining algorithm improvements, software optimizations, and enhanced hardware capabilities, a 3-second long utterance that initially required nearly three minutes to decode (60X real time) now can be decoded in real time.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML