File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1082_metho.xml

Size: 19,630 bytes

Last Modified: 2025-10-06 14:13:06

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1082">
  <Title>An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model*</Title>
  <Section position="3" start_page="0" end_page="405" type="metho">
    <SectionTitle>
THE BASIC STACK DECODER
</SectionTitle>
    <Paragraph position="0"> The stack decoder \[8\], as used in speech, is an implementation of a best-first tree search. The basic operation of a sentence decoder is as follows \[2,5\]:  1. Initialize the stack with a null theory.</Paragraph>
    <Paragraph position="1"> 2. Pop the best (highest scoring) theory off the stack. 3. if(end-of-sentence) output the sentence and terminate. null 4. Perform acoustic and language-model fast matches to obtain a short list of candidate word extensions of the theory.</Paragraph>
    <Paragraph position="2"> 5. For each word on the candidate list: (a) Perform acoustic and language-model detailed matches to compute the new theory output loglikelihood. null i. if(not end-of-sentence) insert into the stack.</Paragraph>
    <Paragraph position="3"> ii. if(end-of-sentence) insert into the stack with end-of-sentence flag = TRUE.</Paragraph>
    <Paragraph position="4"> 6. Go to 2.</Paragraph>
    <Paragraph position="5">  The fast matches \[4,5,7\] are computationally cheap methods for reducing the number of word extensions which must be checked by the more accurate, but computationally expensive detailed matches. 1 (The fast matches may also be considered a predictive component for the detailed matches.) Top-N (N-best) mode is achieved by delaying termination until N sentences have been output.</Paragraph>
    <Paragraph position="6"> 1 The following discussion concerns the basic stack decoder and therefore it will be assumed that the correct word will always be on the fast match list. This can be guaranteed by the scheme outlined in reference \[5\].</Paragraph>
    <Paragraph position="7">  The stack itself is just a sorted list which supports the following operations: pop the best entry and insert new entries according to their scores. The following items must be contained in the ith stack entry:  1. a stack score: StSci 2. a reference time: t_refl 3. a word history i: (path or theory identification) 4. an output log-likelihood distribution: Li(t) 5. an end-of-sentence flag</Paragraph>
  </Section>
  <Section position="4" start_page="405" end_page="406" type="metho">
    <SectionTitle>
THE A* STACK CRITERION
</SectionTitle>
    <Paragraph position="0"> A key issue in the stack decoder is deciding which theory should be popped from the stack to be extended. This is decided by the stack score and the reference time. (All scores used here are log-likelihoods or log-probabilities.) The near-optimal A* criterion \[11\] used here is the difference between the actual log-likelihood of reaching a point in time on a path and a least upper bound on the log-likelihood of any path reaching that point in time:</Paragraph>
    <Paragraph position="2"> where Ai(t) is the A* scoring function, Li(t) is the output log-likelihood, t denotes time, i denotes the path (tree branch or left sentence fragment) and lubL(f) is the least upper bound on Li(f). (This criterion is derived in the appendix.) In order to sort the stack entries, it is necessary to reduce the Ai(t) to a single number (the stack score): StSci =max Ai(f). (3) , It is also convenient at this point to define the minimum time which satisfies equation 3:</Paragraph>
    <Paragraph position="4"> for an appropriately chosen value for a.</Paragraph>
    <Paragraph position="5"> A STACK DECODER FOR CSR WITH A UNIGRAM LANGUAGE MODEL It is not possible to compute the exact least upper bound on the theory likelihoods without first performing the recognition. It is, however, possible to compute the least-upper-bound-so-far (lubsf) on the likelihoods that have already been computed, which requires negligible computation and is sufficient to perform the near-optimal A* search. This creates two difficulties:  1. Since lubL(f) = lubsfL(t) can change as the theories are evaluated, the stack order can also change.</Paragraph>
    <Paragraph position="6"> 2. A degeneracy in determining the best path by SfSc  alone can occur since lubsfL(t) can equal Li(t) for more than one i (path) at different times.</Paragraph>
    <Paragraph position="7"> Problem 1 is easily cured by reevaluating the stack scores StSc every time lubsfL(t) is updated and reorganizing the stack. This is easily accomplished if the stack is stored as a heap \[10\].</Paragraph>
    <Paragraph position="8"> Problem 2 occurs because different theories may dominate different parts of the current upper bound. Thus all of these theories will have a score of zero. The cure is to extend the shortest theory (minimum t_min) which has a stack score equal to the best. If f_refi = f-mini, this can be accomplished by performing a major sort on the stack score StSc and a minor sort on the reference time f_re f .</Paragraph>
    <Paragraph position="9"> This guarantees that lubsfL(t) = lubL(f) for t &lt; t_refp (where p denotes the theory which is about to be popped) and therefore the relevant part of the least-upper-bound has been computed by the time that it is needed. Since the bound, at the time that it is needed, is the least-upper-bound, the search is admissible and near-optimal. Furthermore, when the first sentence is output, the least-upper-bound-so-far will be the exact least-upper-bound.</Paragraph>
    <Paragraph position="10"> A stack pruning threshold can be used to limit the stack size \[16\]. Any theory whose SfSc falls below the threshold can be deleted from the stack. This can be applied on stack insertions and any time the stack is reorganized. This stack pruning threshold has little effect on the computational requirements and can therefore be set very conservatively to essentially eliminate any chance that the correct theory will be pruned.</Paragraph>
    <Paragraph position="11"> In a time-synchronous (TS) no-grammar/unigram language model Viterbi decoder, all word output likelihoods are compared and only the maximum is passed on as input to the word models. Thus by comparison, only theories that dominate the lubsf need be retained on the stack and the stack pruning threshold can be set to zero for top-1 recognition. Since all stack scores, StSc, of all theories popped from the stack will be zero until the first sentence is output, all theories popped from the stack will be in reference time t_min order. (Of course, the stack pruning threshold must be non-zero if a top-N list of sentences is desired.) For top-N recognition, this algorithm adaptively raises the effective computational pruning threshold (which equals the current best StSc) by the minimum required to produce N output sentences,  subject to the limit placed by the stack pruning threshold. null This algorithm is near-optimal and admissible only for a Viterbi decode using non-cross word acoustic models and a no-grammar or unigram language model.</Paragraph>
    <Paragraph position="12"> recognition. (While this algorithm can also perform top-N recognition with or without a language model, it cannot be made equivalent to the no-grammar/unigram language model version for top-N. Its pruning threshold is fixed and it will only output theories whose relative likelihoods do not fall below the threshold.) A STACK DECODER FOR CSR WITH A LONG-SPAN LANGUAGE MODEL The above algorithm fails with a long span language model because the overall best theory can have a less-than-best intermediate score. This less-than-best intermediate score can be locally &amp;quot;shadowed&amp;quot; by the best score and thus will not be popped from the stack \[6\]. An efficient stack decoder algorithm which can be used with cross-word acoustic models, the full (forward) decoder, and longer-span (&gt; 2) language models can be produced by two simple changes: 1. change the stack ordering to be a major sort on the reference time t_ref (favoring the lesser times) and a minor sort on the stack score StSe and 2. use a non-zero stack pruning threshold.</Paragraph>
    <Paragraph position="13"> The reference time t_ref may also be changed from the minimum time which satisfies equation 3 used in the nogrammar/unigram language-model version to t_exit as defined in equation 5. (Either will work and both required similar amounts of computation in tests.) This algorithm appears to be a simplification of one developed at IBM \[3\].</Paragraph>
    <Paragraph position="14"> This algorithm is not admissible because the correct theory can be pruned from the stack. The stackpruning threshold now becomes the computational pruning threshold which controls the trade-off between the amount of computation and the probability of pruning the the correct theory by controlling the likelihood &amp;quot;depth&amp;quot; that will be searched. Unlike the previous algorithm, an (unpruned) theory cannot be shadowed because it will be extended when its reference time is reached. This algorithm is quasi-time-synchronous because it, in effect, moves a time bound forward and whenever this time bound becomes equal to the reference time of a theory, the theory is expanded.</Paragraph>
    <Paragraph position="15"> Note that the stack pruning threshold can also be set to zero for no-grammar/unigram language model top-1 recognition with this algorithm. With a zero stack pruning threshold and t_refl = t_minl, it becomes equivalent to the near-optimal, admissible no-grammar/unigram language model algorithm described above for top-1</Paragraph>
  </Section>
  <Section position="5" start_page="406" end_page="407" type="metho">
    <SectionTitle>
DISCUSSION AND CONCLUSIONS
</SectionTitle>
    <Paragraph position="0"> The above stack-search algorithms have been implemented in a prototype implementation which uses real speech input, but does not yet have all of the features of the Lincoln TS CSR \[13,14,15\]. (The primary missing feature is cross-word phonetic modeling.) The prototype runs faster than does the TS system on the corresponding recognition task, frequently by a significant factor. (In fairness, the TS system does not include a fast match.) Current experience using the DARPA Resource Management Database \[17\] shows the required number of stack pops and the stack size to be surprisingly small.</Paragraph>
    <Paragraph position="1"> In addition, the prototype includes a proposed CSR-NL interface \[12\] and has been run with unigram, word-pair, bigram, and trigram language models accessed through the interface without difficulty. (It has also been run using a no-grammar language model, which, of course, does not require the interface.) This prototype implementation has also been tested with vocabulary sizes up to 64K words. The CSR computation, which is dominated by the fast match, scales approximately as the square root of the vocabulary size.</Paragraph>
    <Paragraph position="2"> Methods for joining the acoustic matching of separate theories and caching of acoustic computations to reduce the acoustic match computation were described in reference \[16\]. These algorithms were tested in a stackdecoder simulator (real stack decoder with simulated input data). The path join accelerator is used in the prototype stack decoder to remove copies of theories which are identical except for non-grammatical items such as optional intermediate silences.</Paragraph>
    <Paragraph position="3"> A* search using the scoring function described by Nilsson \[11\] (equation 6) requires computing the likelihood of the future data (h*(t) in equation 7). The optimal A* decoder requires exact evaluation of h*(t) which requires solving the top-1 recognition problem by some other means, such as a reverse direction TS decoder \[19\], before the A* search can begin. The alternative described here substitutes a near-optimal scoring function which is derived from the A* search and requires negligible additional computation over that required by the search itselfl Since, as noted above, the Lincoln top-1 TS decoder takes more CPU time than does the near-optimal stack decoder, the near-optimal stack decoder algorithm appears to be the most efficient of the  three approaches for top-1 recognition. In addition, the long-span language model version of the stack decoder can very easily integrate long-span language models into the search. However, if top-N recognition is the goal, the optimal A* search may be preferred because, once the price is paid for computing h*(t), the A* search can find the additional N-1 sentences very efficiently for nogrammar/unigram language models \[19\].</Paragraph>
    <Paragraph position="4"> Recently, several other algorithms have been proposed for top-N recognition using A* search \[9,19,22\] which use the Nilsson formulation of the scoring function. All of these approaches use a reverse direction TS decoder to compute h*(t). (A reverse direction top-1 stack decoder could also be used to compute h*(t).) (There are also some proposed non-A* methods for recognizing the top-N sentences \[1,18,21\]. In general, the bidirectional approaches appear to be more efficient than the unidirectional approaches.) These bidirectional A* methods must wait for the end of data (or a pseudo-end-of-data \[9\]) to begin the A* (or the reverse direction) pass. In contrast, because they do not need data beyond that necessary to extend the current theory (this includes data up to t_ref required to choose the current theory), the two stack decoder formulations proposed here can proceed totally left-to-right as the input data becomes available from the front end. The long-span language-model version of the stack search will output all top-N theories with minimal delay following the end-of-data because all theories are pursued in quasi-parallel or, in top-1 mode, it can output the partial sentence as soon as all unpruned theories have a common partial history (initial word sequence). (A similar technique for continuous output after a short delay from continuous input exists for TS decoders \[20\].) One of the motivations for some of these other A* (and top-N) algorithms is as a method for using weaker and cheaper initial acoustic and language models to produce a top-N sentence list for later refinement by more detailed and expensive acoustic and/or language models, which now need only consider a few theories. In contrast the algorithm proposed here integrates both the detailed acoustic and language models directly in the stack search and therefore need only produce a top-1 output. It attempts to minimize the computation by applying all available information to constrain the search. (The stack decoder as described here can, of course, also be used with weak and cheap acoustic and/or language models to produce a top-N list for later processing.) The ultimate choice between the two methods may be determined by the number of sentences required by the top-N approaches and the relative computational costs of the various modules in each system. The architectural simplicity of each system may also have some bearing.</Paragraph>
    <Paragraph position="5"> The stack decoder has long shown promise for integrating long-span language models and acoustic models into a single effective search which applies information from both sources into controlling the search. It has not been used at many sites, primarily due to the difficulty of making the search efficient. The algorithms described above will hopefully remove this barrier.</Paragraph>
  </Section>
  <Section position="6" start_page="407" end_page="408" type="metho">
    <SectionTitle>
APPENDIX: DERIVATION OF
THE A* CRITERION USED IN
EQUATION 2
</SectionTitle>
    <Paragraph position="0"> Nilsson \[11\] states the optimal A* criterion (slightly rewritten to match the speech recognition problem) as</Paragraph>
    <Paragraph position="2"> where fi(t) is the log-likelihood of a sentence with the partial theory i ending at time t, gi(t) is the log-likelihood of partial theory i, and h*(t) is the log-likelihood of the best extension of any theory from time t to the end of the data. (Nilsson uses costs which are interpreted here as negative log-likelihoods. All descriptions here will use sign conventions appropriate for log-likelihoods to be consistent with the rest of the paper.) The theory argmax (mtax fi(t)) is chosen as the next to i be popped from the stack and expanded.</Paragraph>
    <Paragraph position="3"> Equation 6 requires that the computation of the total likelihood of a sentence must be separable into a beginning part and an end part separated by a single time, which disallows this derivation for the full (forward) decoder because the full decoder does not have a unique transition time between two words. Thus, the derivation is limited to a decoder which is Viterbi between words. It also limits the derivation to non-cross-word acoustic models and no-grammar or unigram language model recognition tasks.</Paragraph>
    <Paragraph position="5"> for the best theory with a word transition at time t.</Paragraph>
    <Paragraph position="6"> The function f* (t) is slowly varying with global maxima at the word transition points of the correct theory, at which points it equals the likelihood of the correct theory.</Paragraph>
    <Paragraph position="7"> Specifically, it is maximum at t = 0 and t = T. (T is the end of data.) Since gi(t) is an exact value (rather than a bound or estimate) for a tree search, g*(t) = lubgi(t) and since h*(t) is not a function of i, f*(t) = lubfi(t).</Paragraph>
    <Paragraph position="8"> Subtract equation 7 from equation 6 and define \]i(t)</Paragraph>
    <Paragraph position="10"> This is just equation 2 in a different notation: gi(t) = Li (t) and g* (t) = ubL(t) (specifically lubL(t)) and therefore \]i(t) = Ai(t). Thus, if f*(t) were a constant, \]i(t) would just be an offset from fi(t) and the search would be optimum because argmax (n~ax \]i(t)) would always be i equal to argTax (n ax fi(t)) As noted earlier, f*(t) has maxima at word transition times of the correct theory.</Paragraph>
    <Paragraph position="11"> Thus \]i(t) is zero at word transition times on the correct theory and &lt; 0 for all other i and t. Thus the search is admissible because it can never block the correct theory by giving a better score to an incorrect theory, but sub-optimal because it can cause incorrect theories to be popped from the stack and be evaluated. The evaluation function &amp;quot;error&amp;quot; f* (t) - f* (0) is slowly varying and small, therefore the search is near-optimal.</Paragraph>
    <Paragraph position="12"> Since the stack decoder treats each theory and all points on the likelihood distribution Li(t)) as a unit, each theory is evaluated at its optimum point: the max Ai(t) as t defined in equation 3, to give it its &amp;quot;best&amp;quot; chance and then, for efficiency, the likelihood of all points on the distribution Li(t) are extended in one operation.</Paragraph>
    <Paragraph position="13"> The fact that all StSci are zero until the first sentence is output and the tie is broken by choosing the theory with the minimum reference time t_min, insures that all candidate theories which might alter lubsfLi(t &lt;_ t_minpop) have already been computed. Thus the lubsfL(t) = lubL(t) for t _&lt; t_minpop.</Paragraph>
    <Paragraph position="14"> This derivation shows the stack criterion max StSci with a minimum t_minl tie-breaker to be adequate to perform a near-optimal admissible A*-search Viterbirecognition with non-cross word acoustic models and a no-grammar/unigram language-model using the stack decoder algorithm.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML