File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/90/h90-1016_metho.xml

Size: 11,546 bytes

Last Modified: 2025-10-06 14:12:31

<?xml version="1.0" standalone="yes"?>
<Paper uid="H90-1016">
  <Title>Toward a Real-Time Spoken Language System Using Commercial Hardware</Title>
  <Section position="4" start_page="72" end_page="74" type="metho">
    <SectionTitle>
3. Time-synchronous Forward Search vs
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="72" end_page="74" type="sub_section">
      <SectionTitle>
Viterbi
</SectionTitle>
      <Paragraph position="0"> The search algorithm that is most commonly used is the Viterbi algorithm. This algorithm has nice properties in that it can proceed in real time in a time-synchronous manner, is quite amenable to the beam-search pruning algorithm \[4\], and is also relatively easy to implement on a parallel processor. Another advantage is that it only requires compares and adds (ff we use log probabilities). Unfortunately, the Viterbi algorithm finds the most likely sequence of states rather than the most likely sequence of words.</Paragraph>
      <Paragraph position="1"> To correctly compute the probability of any particular sequence of words requires that we add the probabilities of all possible state sequences for those words. This can be done with the &amp;quot;forward pass&amp;quot; of the forward-backward training algorithm. The only difference between the Viterbi scoring and the Forward-pass computation is that we add the probabilities of different theories coming to a state rather than taking the maximum.</Paragraph>
      <Paragraph position="2"> We presented a search algorithm in 1985 \[7\] that embodied most of this effect. Basically, within words we add probabilities, while between words we take the maximum. It was not proven at that time how much better, if any, this algorithm was than the simpler Viterbi algorithm, and whether it was as good as the strictly correct algorithm that computes  mar with zero-state requires many fewer null arcs.</Paragraph>
      <Paragraph position="3"> the score of each hypothesis independently.</Paragraph>
      <Paragraph position="4"> When we compared these two algorithms under several conditions, we found that there was a consistent advantage for adding the probabilities within the word. For example, when we use the class grammar, we find that the word error rate decreases from 8% to 6%.</Paragraph>
      <Paragraph position="5"> To be sure that the time-synchronous forward search gives us the same performance as the ideal forward score is somewhat more complicated. We must guarantee that we have found the highest scoring sentence with the true forward probability score. One way to find this is to use the exact N-Best algorithm \[2\]. Since the exact N-Best algorithm separates the computation for any two different hypotheses, the scores that result are, in fact, the correct forward probabilities, as long as we set N to a large enough value. A second, much simpler way to verify the time-synchronous algorithm is to see if it ever gets a wrong answer that scores worse than the correct answer. We ran a test in which all incorrect answers were rescored individually using the forward probability. We compared these scores to the forward probability for the correct answer. In no case (out of 300 sentences) did the time-synchronous forward search ever produce a wrong answer that, in fact, scored worse than the correct answer.</Paragraph>
      <Paragraph position="6"> The reason that this whole discussion about the Viterbi algorithm is relevant here is that the Viterbi algorithm is faster than the forward search. Therefore, we use the integer Viterbi algorithm in the forward-pass of the Forward-Backward Search. Since the function of the forward-pass is primarily to say which words are likely, it is not essential that we get the best possible answer. The backward N-Best search is then done using the better-performing algorithm that adds different state-sequence probabilities for the same  word sequence.</Paragraph>
      <Paragraph position="7"> 4. Speed and Accuracy  When we started this effort in January, 1990, our unoptimized time-synchronous forward search algorithm took about 30 times real time for recognition with the WP grammar and a beamwidth set to avoid pruning errors. The class grammar required 10 times more computation. The exact N-Best algorithm required about 3,000 times real time to find the best 20 answers. When we required the best 100 answers, the program required about 10,000 times real time. Since January we have implemented several algoritiams, optimized the code, and used the Intel 860 board to speed up the processing. The N-Best pass now runs in about 1/2 real time. Below we give each of these methods along with the factor of speed gained.</Paragraph>
      <Paragraph position="8">  As can be seen, the three algorithmic changes accounted for a factor of 1,000, while the code optimization and faster processor accounted for a factor of 20. We expect any additional large factors in speed to come from algorithmic changes. When the VLSI HMM processor becomes available, the speed of the HMM part of the problem will increase considerably, and the bottleneck will be in the language model processor. We estimate that the language model computation accounts for about one third of the total computation. null Our current plan is to increase the speed as necessary and complete the integration with the natural language understanding and application backend by September, 1990.</Paragraph>
      <Paragraph position="9"> Accuracy It is relatively easy to achieve real time if we relax our goals for accuracy. For example, we could simply reduce the pruning beamwidth in the beam search and we know that the program speeds up tremendously. However, if we reduce the beamwidth too much, we begin to incur search errors. That is, the answer that we find is not, in fact, the highest scoring answer. There are also several algorithms that we could use that require less computation but increase the error rate. While some tradeoffs are reasonable, it is important that any discussion of real-time computation be accompanied by a statement of the accuracy relative to the best possible conditions.</Paragraph>
      <Paragraph position="10"> In Table I below we show the recognition accuracy results under several different conditions. All results use speaker-dependent models and are tested on the 300 sentences in the June '88 test set. For each condition we state whether the forward pass would mn in less than real time on the SkyBolt for more than 80% of the sentences -- which is basically a function of the pruning beamwidth. The backward pass currently runs in less than 1/2 real time, and we expect it will get faster. We don't yet have a good feeling for how much delay will be tolerable, but our goal is for the delay in computing the N Best sentences to be shorter than the time needed for natural language to process the first sentence, or about 1/2 second. The accuracy runs were done on the SUN 4/280. Based on our speed measurements, we assume that anything that runs in under five times real-time on the SUN  Best algorithm compared with the best non-real-lime conditions. null For each condition we give the word error, I-Best sentence error, and N-Best sentence error for N of 20 and 100. &amp;quot;N-Best sentence error&amp;quot; Results are given for the Word-Pair (WP) grammar and for the Class (CG) Grammar. The conditions WP-XW and CG-XW were done using cross-word triphone models that span across words and have been smoothed with the triphone cooccurence smoothing. These conditions were only decoded with the 1-Best forward-search algorithm, and so produced only word error statistics for reference. The models that do not use cross-word triphones also do not use triphone cooccurence smoothing. Since the forward pass is done using the Viterbi algorithm, this affects the word error rate and the 1-Best sentence error rate, which are measured from the forward pass only.</Paragraph>
      <Paragraph position="11"> Currently we have not run the cross-word models with the N-Best algorithm. These models require more memory than is available on the board, and the computation required in the forward pass is too large. We intend to solve this by using the cross-word models only in the backward direction. Another alternative would be to use the cross-word models to rescore all of the N-Best hypotheses, which could be done relatively quickly. In any case, we decided to make the system work with cross-word models only after we had achieved real time with simpler non-cross-word models.</Paragraph>
      <Paragraph position="12"> As we can see, the results using the WP grammar are quite good. Even without the cross-word models, we find the correct sentence 97.6% of the time within the first 20 choices and 98% of the time within the first 100 choices.</Paragraph>
      <Paragraph position="13"> When we use a beamwidth that gives us real time, we see only a very slight degradation in accuracy. However, as we stated earlier in this paper, the WP grammar is unrealistically easy, both in terms of recognition accuracy and computation. We show these results only for comparison with other real-time recognition results on the RM corpus.</Paragraph>
      <Paragraph position="14"> Recognition with the class grammar is much harder due to higher perplexity and the fact that all words are possible at any time. The word error with cross-word models is 4.7%.</Paragraph>
      <Paragraph position="15"> For the N-Best conditions with the CG grammar we note a larger difference between the sentence errors at 20 and 100 choices. In contrast to the WP grammar in which there are a limited number of possibilities that can match well, here more sequences are plausible. We give the N-Best results for three different speed conditions. The first has a very conservative beamwidth. The second runs at 1.2 times realtime, and the third runs faster than real time. We can see that there is a significant degradation due to pruning errors when we force the system to run in real time.</Paragraph>
      <Paragraph position="16"> There are several approaches that are available to speed up the forward pass considerably. Since the forward pass is used for pruning, it is not essential that we achieve the highest accuracy. In those rare cases where the N-Best finds a different top choice sentence than the forward pass, and this new top choice also is accepted by natural language, we will simply have a delay equal to the time taken for the N-Best backward search. The most promising method for speeding up the forward search is to use a phonetic tree in which the common word beginnings are shared. Since most of the words are pruned out after one or two phonemes, much of the computation is eliminated.</Paragraph>
      <Paragraph position="17"> Conclusion We have achieved real-time recognition of the N-Best sentences on a commercially available board. When we use a WP grammar, there is no loss in accuracy due to real-time limitations. However, currently, when using a class grammar there is a degradation. We expect this degradation to be reduced as planned algorithm impruverr~mts are implemented. null Most of the increase in speed came from algorithm modifications rather than from fast hardware or low-level coding enhancements, although the latter improvements were substantial and necessary. All the code is written in C so there is no machine dependence. All told we sped up the N-Best computations by a factor of 20,000 with a combination of algorithms, code optimization, and faster hardware.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML