XML Viewer - h94-1079

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1079_metho.xml
Size: 25,633 bytes
Last Modified: 2025-10-06 14:13:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1079">
  <Title>The Lincoln Large-Vocabulary Stack-Decoder Based HMM CSR*</Title>
  <Section position="2" start_page="0" end_page="399" type="metho">
    <SectionTitle>
2. The Stack Decoder
</SectionTitle>
    <Paragraph position="0"> The stack decoder is organized as described in reference \[14\]. The basic paradigm used by the stack decoder is:  I. Pop the best theory (partial sentence) from the stack. 2. Apply acoustic and LM fast mntches\[3, 5\] to produce a short list of candidate next words.</Paragraph>
    <Paragraph position="1"> 3. Apply acoustic and LM detailed matches to the candidate words.</Paragraph>
    <Paragraph position="2"> 4. Insert surviving new theories into the stack.  This paradigm requires that theories of different lengths be compared. Therefore, the system maintains a least-upper-bound or envelope of all previously computed theory output log-likelihoods (LLi). (The acoustic log-likelihoods and the envelope are functions of time.)</Paragraph>
    <Paragraph position="4"> Theories whose stack score, StSc, is less than a threshold are pruned from the stack. The stack entries are sorted by an major sort on most likely exit time, t_ezit, and a minor sort on StSc.</Paragraph>
    <Paragraph position="5"> Thus the shortest theories are expanded first which has the net effect of working on a one to two second active region of the input and moving this active region left-to-right through the data.</Paragraph>
    <Paragraph position="6"> The &amp;quot;extend each partial theory with one word at a time&amp;quot; approach allows the use of a particularly simple interface to the LM.</Paragraph>
    <Paragraph position="7"> All requests to the LM are of the form: &amp;quot;Give me the probability of this one-word extension to this theory.&amp;quot; This has been exploited in order to place the LM in an external module connected via sockets (pipes) and specified on the on the command line\[10\]. Since the N-gram LMs currently in use are so trivial to compute, the LM fast match probability is currently just the LM detailed match probability This stack decoder, since all information is entered into the search as soon as possible, need only pursue a &amp;quot;find the best path and output it&amp;quot; strategy. It is also quite possible to output a list of the several best sentences with minor modificatious\[13, 14\].</Paragraph>
    <Paragraph position="8"> Given this search strategy, it is very easy to produce output &amp;quot;on the fly&amp;quot; as the decoder continues to operate on the incoming data. Any time the first N words in all entries on the stack are the same they may be output. (This is the analog of the &amp;quot;confluent node&amp;quot; or &amp;quot;partial tracebac\]d' algorithm \[21\] in a time synchronotm decoder.) No future data will alter this partial output.</Paragraph>
    <Paragraph position="9"> Similarly, since the active zone moves left-to-right though the data, the stack decoder can easily be adapted to unbounded length input since the various envelopes and the like need only cover the active region. In practice this involves an occasional stop and pass over the internal data shifting it in buffers, altering pointers,  and renormalizing it to prevent underflow, but these are simply prob\]elns of implementation, not theory or basic operation.</Paragraph>
  </Section>
  <Section position="3" start_page="399" end_page="399" type="metho">
    <SectionTitle>
3. The Fast Match
</SectionTitle>
    <Paragraph position="0"> The aconstic fast match (FM) uses a two pass strategy. Both passes search a left-dlphone tree generated from the recognition vocabulary. The first pass takes all theories for which t.ezit i =rain J t.ezitj and combines their log-likelihoods (take their vector r~ximum tbr a Viterhi decoder) to create the input log-likelihood for the decoder. This decode produces two outputs: it sets pruning thresholds for the second passes and marks all dlphone nodes for words whose FM output log-likelihood exceeds a FM output threshold envelope.</Paragraph>
    <Paragraph position="1"> The second pass is applied for every theory which was included in the above combination. It applies the exact log-likelihood from the detailed match as input to the left-diphone tree using the pruning thresholds from the first pass and searching only the marked allphone nodes. The word output log-llkelihoods are added to the LM log-probabilities to produce the net word output log-likelihoods. The cumulative maximum of these net log-liks\]Jhoods pins a (negative) threshold now produces the FM output threshold envelope. Any word whose output log-likelihood exceeds this threshold envelope is placed on the candidate word list for the detailed match. Both passes of the fast match use a beam pruned depth-first (DF) search of the dlphone tree. (The beam pruning requires a cumulative envelope of the the state-wise log-likeUhoods.) The DF search is faster than a tlme-synchronous (TS) search due to its localized data. At any one time, it only needs an input array (which was used very recently), and output array (which will be used very soon), and the parameters of one phone whereas the TS search must touch every active state before moving to the next time. This allows the DF search to stay in cache (~1 MB on many current workstations) and to page very efficiently. The TS search, in comparison, uses cache very inefficiently and will virtually halt if it begins to page. (A stack search was also tested. Becanse the operational unit--the diphone--is so small, its overhead canceled any advantages. Its computational locality is also not as good as that of the DF search.) A goal of recognition system design is to rn|n|m|ze the over~ nm time without loss of accuracy. In the current system, this minimum occurs (so far) with the relatively expensive fast match described above. It is the largest time consumer in the recognizer. Using generous pr.n;ng thresholds that reduce the number of fast match proning errors to below a few tenths of a percent, this fast match allows only an averalp of about 20 words of a 20K word vocabtdary to he passed to the detailed match.</Paragraph>
  </Section>
  <Section position="4" start_page="399" end_page="399" type="metho">
    <SectionTitle>
4. The Detailed Match
</SectionTitle>
    <Paragraph position="0"> The detailed match (DM) is currently implemented as a beam-pruned depth-fast searched triphone tree. The tree is precompiled for the whole vocabulary&amp;quot; m|nu8 the silence phones, but only triphone nodes corresponding to the FM candidate words are searched. The LM log-probabilities are integrated into the triphone tree to apply the information as soon as possible into the search. The beam pruning again requires a state.wlse log-likelihood cumulative envelope. Because the right context is not available for cross-word triphones, the final phone is dropped from each word and prepended to the next word.</Paragraph>
    <Paragraph position="1"> The silence phones, because they may have very long duration are &amp;quot;contirnmble'--that is they run for a limited duration and then are placed on the stack for later continuation. They are computed using very small time synchronous decoders so that their state can be placed on the stack to allow the continuation. This allows a finite fixed-slze likelihood buffer in each stack entry and reduces decoding delays.</Paragraph>
    <Paragraph position="2"> &amp;quot;Covered&amp;quot; theories are pruned from the search\[13\]. One theory covers another if all entries in its output log-likelihood arras, are greater than those of the second theory at the corresponding times and its LM probabilities will be the same for all possible extensions. A covered theory can never have a higher likelihood than its covering theory and is therefore pruned from the search. (Thk k analogous to a path join in a TS decoder.) For any limited leftcontext-span LM, such as an N-gram LM, this mecha-;em prevents the exponential theory growth that can otherwise occur in a tree search.</Paragraph>
  </Section>
  <Section position="5" start_page="399" end_page="402" type="metho">
    <SectionTitle>
5. Component Algorithms
</SectionTitle>
    <Paragraph position="0"> This recognition system includes a variety of algoritluns which are used as components supporting the major parts described above.</Paragraph>
    <Section position="1" start_page="399" end_page="399" type="sub_section">
      <SectionTitle>
5.1 DF Search Path Termination
</SectionTitle>
      <Paragraph position="0"> It is not always possible to determine when to terminate a search path in a non-TS search because the first path to reach a point in time will not be able to compare its likelihood to the likelihood of any other path. Thus a heavily pruned TS left-diphons tree no-grammar decoder is used to produce a rough estimate of the state-wise envelope for all theories up to the current time. This envelope is used primarily to alter the beam-pruning thresholds of the FM and DM such that the search paths terminate at appropriate times.</Paragraph>
      <Paragraph position="1"> This decoder requires only a very small amount of computation.</Paragraph>
    </Section>
    <Section position="2" start_page="399" end_page="400" type="sub_section">
      <SectionTitle>
5.2 Bayesian Smoothing
</SectionTitle>
      <Paragraph position="0"> In a number of situations it is necessary to smooth a sparse-data estimate of a parameter with a more robust but less appropriate estimate of the parameter. For instance, the mixture weights for sparse.data triphone pdfs might be smoothed with coro responding mixture weights from the corresponding diphones and monophones\[2~ or, in an adaptive system, the new estimate based upon a small amount of new data might be smoothed with the old estimate based upon the past data and/or training data. The following smoothing weight estimation algorithm applies to parameters which are estimated as a \[weighted\] average of the traini,~ data.</Paragraph>
      <Paragraph position="1"> A standard Bayesian method for comblnln~, new and old estinmtes of the same parameter is N. No z = N. + No x. + N~Xo where x is the parameter in question and N is the number of counts that went into each estimate and the subscripts n and o denote new and old. Similarly if one assumes the variance v of each estimate is inversely proportional to N (i.e. v c&lt; ~),</Paragraph>
      <Paragraph position="3"> The above asstunes z. and ~o to he est;nmtes of the same parameter. However, in the case of smoothing, the purpose is to use data from a different but related old parameter to improve the estimate of the new parameter. For the above examples, z n might * be an estimate from a triphone and z o from a diphone or monophone, or x. might he an estimate from the current speaker (i.e.</Paragraph>
      <Paragraph position="4"> be speaker dependent) and x o from speaker-independent training data. Thus</Paragraph>
      <Paragraph position="6"> If one assumes that the expected values of z and z o differ by a zero mean Gaussian representing the unknown bias,</Paragraph>
      <Paragraph position="8"> then a corrected estin~te for the old variance is ~/o= ~o+~d.</Paragraph>
      <Paragraph position="9"> If we now substitute the new value for v o and return to the initial form of the estim~,t, or,</Paragraph>
      <Paragraph position="11"> Note that No ~ _~ Nd and thus the smoothing equation discounts the value of the old data accord;nf to N d which, for the above examples of emoothmg, may be determined empirically. This equation can he trivially extended to include multiple old estimates for emooth;r~, a trlphone with the left diphone, right diphone, and monophone. In this recognition system, symmetries and linear interpolation across states have been used to reduce the number of Nd's for triphone smoothing from twelve to three. This smoothing scheme has also been tested for spe.~er adaptation (results below) and might also be used in language modeling.</Paragraph>
    </Section>
    <Section position="3" start_page="400" end_page="400" type="sub_section">
      <SectionTitle>
5.3 Channel Compensation
</SectionTitle>
      <Paragraph position="0"> Blind channel compensation during both training and recognition is performed by first sccAnnlng the sentence and averaging the mel-cepstra of the speech frames. This average is then subtracted from all frames (commonly known as meLcepstral DC removal.) This does not affect either of the differential observation streams.</Paragraph>
    </Section>
    <Section position="4" start_page="400" end_page="400" type="sub_section">
      <SectionTitle>
5.4 Speaker Adaptation
</SectionTitle>
      <Paragraph position="0"> One cannot always anticipate the identity of the user when training a recognizer. The &amp;quot;standard&amp;quot; SI approach is to pool a number of speakers to attempt to provide a set of models which gives adequate performance on a new user. This approach, however, creates models which are rather broad because they attempt to cover any way in which any speaker will pronounce each word rather than the way in which the current speaker will pronounce the words.</Paragraph>
      <Paragraph position="1"> (This is consistent with the fact that SD models outperform SI models given the same amount of training datao) Speakers may also not be willing to prerece~d training or rapid enrollment data and wait for the system to process this data.</Paragraph>
      <Paragraph position="2"> One solution for this problem is recognition-time adaptation, in which the recognizer adapts to the current user during normal operation. This solution also has the advantage that the recognizer can track changes in the user's voice and changes in the acoustic environment. The paradigm used here is to initialize the system with some set of models such as SI models or SD models from another speaker, recognize each utterance with the current set of models, and finally use the utterance to adapt the current models to create a new set of models to be used to recognize the next utterance\[9, 16\]. If the user supplies any inforraation to correct or verify the recognized output, the adaptation can be supervised, otherwise the adaptation wKI be unsupervised.</Paragraph>
      <Paragraph position="3"> The adaptation algorithm used here is a simple smoothed  maximum-likelihood scheme: 1. Start with some set of acoustic models, M which have had their DC removed as in channel compensation.</Paragraph>
      <Paragraph position="4"> 2. Perform channel compensation (mel-cepstral DC removal).</Paragraph>
      <Paragraph position="5"> 3. Recognize the utterance using the current model, M.</Paragraph>
      <Paragraph position="6"> 4. Compute the state sequence and alignment using either the corrected text (supervised) or the recognized text (unsupervised). null 5. Compute new estimates of the model parameters M new using 1 iteration of Viterbi training.</Paragraph>
      <Paragraph position="7"> 6. Update the model by smoothing the new esthnates of the</Paragraph>
      <Paragraph position="9"> 7. Go to 2.</Paragraph>
      <Paragraph position="10"> The adaptation rate parameter', A, trades o~ tl~e adaptation speed and the limiting recognition performance. A need not be constant--but was held constant in these experiments. Only the Ganssian means of a TM system with a tied variance were adapted in these experiments. (Adapting other parameters wm be explored at a later date.) A number of experhnents using simplified phonetic models were performed to evaluate SI starts, cross-sex SD starts, and same-sex SD starts\[16\] using the RM database\[17\]. The adaptation helped in all cases, even for unsupervised adaptation of the cross-sex starts which started with a word error rate of 94~. A system which trained SI mixture weights with SD Gausslam and then fixed the weights while training a set of SI Gaussiaus was also tested in the hope that, once adapted, it wonld look more like an SD system than a system started with normal SI models. Its unadapted performance was somewhat worse than the normal SI system, but after adaptation, its performance was better than the normal SI system.</Paragraph>
      <Paragraph position="11"> The results for our best SI-109 trained system (TM-3, cross- null the error rates for both supervised and unsupervised adaptation. In no case did any system diverge. A Bayesian adaptation scheme based upon the above algorithm was also tested, but was no better than the simple ML algorithm. Unfortunately, the improvement was much less when tested upon the WSJ database (see below).</Paragraph>
    </Section>
    <Section position="5" start_page="400" end_page="400" type="sub_section">
      <SectionTitle>
5.5 Pdf cache
</SectionTitle>
      <Paragraph position="0"> Tied mixture pelfs are relatively expensive to compute and a pdf cache is necessary to m~nlmize the computationalload. Each c~,~he location is a function of the state s and the time t. The cache must also be able to grow efficiently upon demand and discard outdated entries efficiently. Algorithms such as hash tables do not grow efficiently and have terrible memory cache and paging hehavier.</Paragraph>
      <Paragraph position="1"> Instead, the pdf cache is stored as a dynamically allocated three dimensional array: prig\[tiT\]is\]it%T\] where % is the modulo operator. Only the first level pointer array (it/T\]) is static, both the is\] pointer arrays and the actual storage locations it%T\] are allocated dynamically. Outdating is simple: remove all pointer arrays and storage locations for tiT &lt; t '/T (integer arithmetic), allocatic~l occurs whenever a null pointer is traversed, and access is just two pointers to a one dimensional array. It is also a very good match to a depth-first search since such a search accesses the states of a phone sequentially in time for a number of time steps which gives very good memory cache and paging performance. This caching algorithm is used in both the trainer and the recognizer.</Paragraph>
    </Section>
    <Section position="6" start_page="400" end_page="401" type="sub_section">
      <SectionTitle>
5.6 Initialization of the Gaussians
</SectionTitle>
      <Paragraph position="0"> Previously, the Gaussiaus were initialized as a single mixture by a binary splitting EM procedure. However, it was discovered that these sets of Gaussians tended to be degenerate (i.e. a number  of the Gauaslans were identical to at least one other Gaussian). Chemgixlg the initialization procedure to a top-1 EM (in effect a K-mes.ns that also trains the mixture weights) removed the degeneracy. This did not alter the recognition accuracy, but significantly reduced the mixture summation computation since these sums are observation pruned (only compute the summation terms for the Gausslmrs within a threshold of the best Gausslan).</Paragraph>
    </Section>
    <Section position="7" start_page="401" end_page="401" type="sub_section">
      <SectionTitle>
5.7 'I~rainer Quantization Error Reduction
</SectionTitle>
      <Paragraph position="0"> The SI-284 training condition of WSJ1 uses 30M frames of training data and, in the Bmlm-Welch training procedure, significant fractions, of these frames are snmrned into ~Ilgle numbers. The number of mixture weights (167M, see below) for the largest set of models was so large that only two byte integer log-probs could be allocated to each accumulator. (Quantization in the estimation of the mixture weights flattens the distribution.) Multi-layer sums were used to reduce the quantization error without unduly increasing the the dataspace requirements.</Paragraph>
      <Paragraph position="1"> Since there were only a relatively few Gauasians in this system (771), qnAnt~zatlen in estimating them was reduced by the use of double-precisien accumulators and a change of variable to additionally reduce the error in estimating the variance: If one substitutes z~ for zi where z~ -- ~vi - ~ where ~ is an estimate of ~ will reduce the second term and thus the quantizatlon error. 2 from the previous iteration can be used as ~ in the current iteration.</Paragraph>
    </Section>
    <Section position="8" start_page="401" end_page="401" type="sub_section">
      <SectionTitle>
5.8 Data-Driven Allophonic Tree Cluster-
</SectionTitle>
      <Paragraph position="0"> ing Previous techniques for allophonlc tree clustering have generally used a single phonetic rule (simple question) to make the best binary split (according to an information theoretic metric) in the data at each node and some of these techniques alternate between splitting and combining several nodes to minimize the data reduction by forming &amp;quot;compound questions&amp;quot; at the nodes\[2\]. Another approach is to ask the &amp;quot;most.compound question&amp;quot; from the start. In this approach, if one is searching for the best split based upon the, for instance, right phonetic context and there are N right phonetic contexts in the triphones assigned to the current node, then there are 2 (N-l) possible splits. (N can easily be greater than one hundred in some of the nodes near the root of the tree.) Such a search problem can be solved by simulated annealing, genetic search, or multiple quenches from random starts. All three were tried and multiple quenches from random starts appeared to give the highest probability of obtaining the optimum split for a given amount of CPU time. Finally, the pdf weights at each node are smoothed with those of its parent using the Bayesian smoothing descried above. This smoothlnf is carried out from the root down toward each leaf so that, in effect, each node is smoothed by all of the data. The software for this technique has been developed and debugged on the RM database, but we have not yet had sufficient time to test this algorithm on a large vocabulary task. (This algorithm is not currently in use.)</Paragraph>
    </Section>
    <Section position="9" start_page="401" end_page="402" type="sub_section">
      <SectionTitle>
5.9 Parallel Path Search of a Network
</SectionTitle>
      <Paragraph position="0"> In a simple single pass fast match, the fast match network (phonetic tree in this system) must be searched once per theory. This is very expensive because the same network must be searched over the same input data many times. One method for reducing this computation is searching the network once with a technique which computes many inputs in paral/el. This search technique represeats the data as two data structures: a &amp;quot;max structure&amp;quot; which contains maximum (for a Viterbl search) with a pointer to a &amp;quot;delta structure&amp;quot; which contains a link count and a list of individual deltas such that the sum of the maximum and the deltas gives the individual Iog-probabilitles. A pass over the input data will create one max structtwe per input time step and fewer delta structures since delta structures can be shared by any number of max structures. Many operations (60-80~ in these experiments) of the network decode will share the same delta structure and thus the log-probabillties corresponding to all of the inputs can be computed with just operations on the max structures. When paths represented by max structures with different delta structures join, then operations of linking, upllnkln~, and/or creation must be performed on the delta structm-es. The link count is used to garbage collect unlinked delta structures. This algorithm was used for a while in the fast match, but has been replaced by the two pass algorithm described above which is faster and uses less space.</Paragraph>
      <Paragraph position="1"> 5.10 Gaussian Variance Bound Addition A wen-known problem in ML estimation of Gauasian-dependent variances in Gauasian mixtures is variance values that go to zero.</Paragraph>
      <Paragraph position="2"> Two common methods for preventing this singldarity are lower bounding or using a grand variance. Simple addition of a constant to each variance has been found to be a superior alternative to lower bounding: it is equally trivial to apply and has yielded superior recognition performance on several recognition tasks. For instance, for several tasks using single observation stream Gaussian- null In both tasks, the performance was improved by over two standard deviations by the use of variance addition. While not needed to insure non-singnlarity, variance addition was also found to improve recognRion in a grand variance system:  In spite of the robustness of the estimate of the grand variance, the performance is improved significantly by variance limiting and even more by the variance addition.</Paragraph>
      <Paragraph position="3"> Clearly the variance addition is doing something more than just preventing singular variances. One possible viewpoint is that variance addition is a soft limiting function. A simple bound throws away all information about the original variance while the addition retains some of the original information. Another possible view is that the variance addition is providing signal-to-noise (S/N) ratio compensation. Each component of the observation vector contains both useful signal and noise. Variance addition might act like a Wiener filter in adjusting the gain on each component appropriately: null</Paragraph>
      <Paragraph position="5"> where the second term on the left is the normal suram~tion term in the exponent of a diagonal covariance Gaussian and the first term on the left is analogous to a Wiener filter if llm represents the noise power. (In the above systems one would expect the measurement and quantizatien noise power to be the same in all observation components.) This technique was discovered too late to be included in any of the following recognition results.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML