XML Viewer - h92-1081

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1081_metho.xml
Size: 17,126 bytes
Last Modified: 2025-10-06 14:13:06
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1081">
  <Title>The Lincoln Large-Vocabulary HMM CSR*</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
THE BASIC HMM SYSTEM
</SectionTitle>
    <Paragraph position="0"> The basic system, with the exception of the decoder, is very similar to the earlier Lincoln tied-mixture (TM) systems. The system used here has two observation streams (TM-2): mel-cepstra and time differential melcepstra. (Due to time limitations, the second differential mel-cepstral observation stream used in the TS decoder for SI tasks was not tested.) The system uses Gaussian tied mixture \[4,6\] observation pdfs and treats each observation stream as if it is statistically independent of all others. Triphone models \[20\] are used to model phonetic coarticulation. (Cross-word triphones, which are a feature of the old TS decoder, will be implemented later.) These models are smoothed with reduced context phone models \[20\]. Each phone model is a three state &amp;quot;linear&amp;quot; (no skip transitions) HMM. The phone models are trained by the forward-backward algorithm using an unsupervised monophone bootstrapping procedure. The recognizer extrapolates (estimates) untrained phone models, contains an adaptive background model, allows optional intermediate silences, and can use any left-to-right stochastic LM. The LM module is interfaced via a proposed CSR-NL interface\[ll\].</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="399" type="metho">
    <SectionTitle>
THE STACK DECODER
</SectionTitle>
    <Paragraph position="0"> The stack decoder is organized according to the description in reference \[16\] and uses the long-span LM search control strategy. The basic paradigm used by the stack decoder is: remove the best theory from the stack, apply the fast matches (FM) to find a small number of potential successor words, evaluate the log-likelihood of these successors with the detailed matchs (DM), and insert the most promising new theories back onto the stack\[2,3,5\] This paradigm requires that theories of different lengths  be compared. Therefore, this system maintains a &amp;quot;leastupper-bound so-far&amp;quot; (lubsf) of all previously computed theory output log-likelihoods. (Acoustic log-likelihoods and the lubsf are functions of time.) The maximum of the difference between this lubsf and each theory output log-likelihood (StSc &lt; 0) is used to determine the best theory\[15,16\]. Theories whose StSc is less than a threshold are pruned from the stack. Reference \[16\] defines a method for estimating the most likely theory output time, t_exit. The stack entries are sorted by a major sort on t_exit and a minor sort on StSc. Thus the theories are extended primarily on a time basis.</Paragraph>
  </Section>
  <Section position="5" start_page="399" end_page="399" type="metho">
    <SectionTitle>
THE FAST MATCH
</SectionTitle>
    <Paragraph position="0"> The acoustic fast match (AFM) algorithm used here is an HMM phonetic tree generated from the vocabulary\[3\].</Paragraph>
    <Paragraph position="1"> The output log-likelihood of the current theory is input to the root of the tree and the paths are evaluated using a TS beam search. If an output state's log-likelihood exceeds a threshold, the corresponding word is activated and the best score is recorded. (All references to scores in this paper refer to log-likelihoods.) This tree search needs to be terminated to limit its computation. The beam pruning threshold used in the AFM search is computed from an estimate of the upper-bound of the AFM state log-likelihoods (AFM-bound) and, when all states are pruned, the AFM terminates.</Paragraph>
    <Paragraph position="2"> This AFM-bound is computed by a reentrant phonetic tree. (Unlike the AFM tree, the leaves of this tree connect back to the root to provide a path for a word exit to enter the next word. Thus the scores in the FM tree drop off after the word ends while the upper bound of the scores in the FM-bounding tree does not.) This reentrant tree is, in effect, an efficient implementation of a no-grammar recognizer whose only output of interest is the AFM-bound.</Paragraph>
    <Paragraph position="3"> errors occur.) Any of a number of phonetic units can be used in these trees: the goal is to minimize the total time required to compute the FMs and the DMs without increasing the error rate over that of the DMs alone. An elaborate (and expensive) FM will minimize the DM computation while a very cheap FM will result in a large amount of DM computation. Any of a large number of phonetic units can be used: triphones, left-diphones, rightdiphones, monophones, upper bounding context phones, simplified network phones. (An upper bounding context phone is a diphone or monophone whose scores are an upper bound of all scores which would be produced by the triphones covered by the context phone. A simplified network phone might collapse its states into fewer states.) The two trees need not use the same phonetic units and each tree can also use a mix of phonetic units. One extreme would be triphone trees (maximally complex for a triphone based recognizer) and the other extreme would one-state monophone trees. It is also possible to use simplified observation pdfs to reduce the computation. Each of these variations must be tested to evaluate the trade-offs. The Lincoln system currently uses TM left-diphones in both trees. Since TM pdfs are relatively expensive to compute, they are cached to prevent recgmputation.</Paragraph>
    <Paragraph position="4"> Because the theories are searched in dominantly t_exit order, it is possible to further reduce the total AFM computation time by grouping all of theories on the stack which have t_exit's within a small time zone, add their output likelihoods (for a full decode), and apply this sum as input to a single execution of the AFM tree search.</Paragraph>
    <Paragraph position="5"> (Substitute maximum for sum to perform a Viterbi decode.) This single AFM computation may be somewhat more expensive than the AFM computation for a single theory, but it reduces the number of AFM executions.</Paragraph>
    <Paragraph position="6"> Once the AFM has completed, the LM fast match (LMFM) log-likelihoods are added to the AFM scores and the result is compared to another threshold. The set of words which survives the second threshold is passed to the DMs. (If an expensive LM algorithm is used, inexpensive estimates of the log-likelihoods may be used in the LMFM. Since the N-gram LMs used in this effort are very cheap to compute, the exact LM DM log-likelihood was used.) If the FM-likelihood is guaranteed to be greater than or equal to the DM-likelihood and the FM decision threshold is the DM lubsf, the FM will be admissible. (An admissible FM is an FM which is guaranteed not to cause any search errors\[3\]. This statement also assumes the beam pruning is generous enough that no FM-tree search</Paragraph>
  </Section>
  <Section position="6" start_page="399" end_page="400" type="metho">
    <SectionTitle>
THE DETAILED MATCH
</SectionTitle>
    <Paragraph position="0"> The DM is implemented as a one-word-at-a-time beam-pruned TS ttMM applied to each word which survives the FMs. The input likelihood for each word decode comes from the output likelihood array in each stack entry. (This theory output log-likelihood output must time truncated in order to fit the important portion into this finite array before inserting any new theory onto the stack.) There is rarely any difficulty fitting the output of a word into this array, but it may not be possible for a continuing sound such as a zone of background (silence).</Paragraph>
    <Paragraph position="1"> This is handled by using &amp;quot;continuable&amp;quot; background models. The state of the background HMM is also stored on the stack and a long background is modeled as a succession of theories ending in background. (Of course,  normal input is possible only for the first of this series of background theories. The later theories rely on the state information.) This also enables a theory to decide that a transition to background has occurred without waiting for the next word to begin.</Paragraph>
    <Paragraph position="2"> In reference \[15\], a technique for eliminating theories from the stack which are &amp;quot;covered&amp;quot; by an &amp;quot;LM-futureequivalent&amp;quot; (LMF-equivaient) theory is proposed. (One theory covers another if all entries in its output log-likelihood array are greater than those of the second theory at the corresponding times.) Two theories are LMF-equivalent if the probabilities of all future word sequences are the same for both theories. Thus, for an N-gram LM, any theories which share the same N1 final words are LMF-equivalent. Any LMF-equivalent covered theory can never beat its covering theory and therefore can be eliminated. This is analogous to a path join in a TS decoder. The mechanism also serves to eliminate the poorer of two theories which differ only in optional inter-word backgrounds. (Since optional inter-word backgrounds are not considered by the LM, they may be eliminated before determinating LMFequivalence.) For any limited left-context-span LM, this mechanism prevents the exponential theory growth that can occur in a tree search.</Paragraph>
    <Paragraph position="3"> The words passed to the DM by the FM are generally acoustically similar and thus frequently share many of the triphones. Therefore the same observation pdfs are likely to be needed more than once. As in the FM, the TM likelihoods are cached to minimize the cost of reuse.</Paragraph>
    <Paragraph position="4"> This stack decoder does not yet include cross-word phonetic models. It will be possible to add them to the system, but they will certainly increase the complexity of the acoustic DM and perhaps also of the AFM (depending on the type of phonetic unit used in the AFM). Since the system still has some known difficulties/bugs, the implementation of the cross-word phonetic models will be delayed until these problems are under control. Since the 5K word WSJ vocabulary already contains over 6K word-internal triphones and cross-word triphone models will greatly increase this number, practical machine size limits dictate that clustered triphones \[9,10\] or lower context phonetic units, such as semiphones \[14\], be used to reduce the memory required to implement cross-word phonetic models.</Paragraph>
  </Section>
  <Section position="7" start_page="400" end_page="401" type="metho">
    <SectionTitle>
RECOGNITION RESULTS
</SectionTitle>
    <Paragraph position="0"> The initial work developing and implementing the above described stack decoder was performed using the Resource Management (RM) database\[18\]. The WSJ-pilot database training and development-test data has only been fully available for about 5 weeks (as of this writing) and therefore the number of experiments that have been performed on it is limited. Where possible, results will be reported on the WSJ-pilot database, but some results will be quoted from work performed on the RM databases. All results must be considered preliminary, particularly since, as noted above, only non-cross-word triphones are being used and the recognizer has known but as yet unfixed algorithmic/implementation bugs.</Paragraph>
    <Paragraph position="1"> One result that became obvious very quickly after transitioning to the WSJ data was that algorithmic decisions made on the RM data could be very inappropriate for the WSJ task (and presumably any similar large vocabulary task). For instance, work on the RM task suggested that a triphone FM tree with a monophone FM-bounding tree was a good choice for the AFM. This worked very well for RM but rather slowly for WSJ. The triphone FM dominated the computation and was so slow that it slowed down the entire system. The diphone trees mentioned above were significantly faster for WSJ and still worked very well for RM. Similarly, the run-times are much longer and the recognition error rates are much higher for WSJ experiments indicating that it is a significantly harder task than RM. The stack decoder is also more than an order-of-magnitude faster than the TS decoder on an RM with a (full-branching) bigram LM task.</Paragraph>
    <Paragraph position="2"> A series of no-LM tests using RM training and test data was performed to demonstrate the large vocabulary capability of the stack decoder. Since a dictionary was not available at the time this test was performed, a &amp;quot;triletter&amp;quot; dictionary was used (ie. each three letter sequence is used in the same fashion as one would use a triphone). The recognizer used RM words augmented with WSJ words to achieve the desired vocabulary. Over a vocabulary size range of 1K to 64K words, the system ran effectively with computation time proportional to the square root of the vocabulary size. The stack decoder used in this test contained a triphone-based FM and thus this result is mostly indicative of the FM computational requirements. This decoder was also demonstrated on the 64K-word task using a perplexity 79 bigram LM.</Paragraph>
    <Paragraph position="3"> The stack decoder was tested on a variety of the conditions provided by the WSJ-pilot database (Table 1).</Paragraph>
    <Paragraph position="4"> Due to the limited time available and the immature state of the decoder, only a subset of the available conditions could be tested. Since we were primarily interested in the performance of the decoder, only closed vocabulary tests were performed. (In a closed vocabulary test, all words in the test set are in the recognizer's vocabulary.) The language models are N-gram back-off LMs\[8,12\]. The bi-gram models are &amp;quot;baseline&amp;quot; models and the dictionary is  a function word dependent triphone dictionary derived from the &amp;quot;baseline&amp;quot; dictionary supplied by Dragon. (The baseline components are standardized components supplied with the database\[17\].) Inspection of the actual output of the system reveals a non-trivial number of malfunctions. (The total effect of these problems on the results is probably less than 10% of the numbers in Table 1.) In some cases, the likelihood of the output sentence is less than the likelihood of the correct sentence. This could be caused by a pruning error (either FM or DM) or a bug in one (or more) of the routines. Another problem which shows up is an incorrect likelihood for the output theory, probably due to occasional errors in locating the most likely output time for a theory (t_ezit).</Paragraph>
    <Paragraph position="5"> Inspection of these results (Table 1) suggests several ob-' servations. Comparison of lines 2 and 3 show a significant improvement (8.0% v. 10.1% word error) when 2400 rather than 600 SD training sentences are used. Thus, the &amp;quot;knee&amp;quot; in the function of performance vs. amount of training data is not reached by 600 SD training sentences. Comparison of the LSD trained systems shows the error rate to increase less than linearly with the perplexity: V=5K, p=44: 6.0%; V=5K, p=80: 8.0%; V=5K, p=l18: 10.5%; V=20K, p=158: 13.6%; and V=20K, p=236: 18.0%.</Paragraph>
  </Section>
  <Section position="8" start_page="401" end_page="402" type="metho">
    <SectionTitle>
RAPID SPEAKER ENROLLMENT
</SectionTitle>
    <Paragraph position="0"> There are four basic methods of producing acoustic models for speech recognition: static SI training, static SD training, rapid speaker enrollment, and recognition-time adaptation. The two static methods train the models using prerecorded data and do not change the models thereafter. Rapid speaker enrollment records a small amount of data from a speaker and uses the data to adapt an existing set of models. Recognition-time adaptation adapts the models to the speaker during the recognition process and may be supervised or unsupervised depending on whether or not the speaker corrects the recognition output. We have added a rapid enrollment mode to our TM trainer.</Paragraph>
    <Paragraph position="1"> The rapid enrollment algorithm used is: read an existing set of TM models into the trainer and adapt (train) only the Gaussians based upon the new data\[19\]. To date, only a few pilot experiments using one test speaker have been performed, shown in Table 2. (The recognition experiments were performed using an obsolete version of the recognizer with a higher error rate than the one used to produce the database results, so the two tables should not be compared.) These results suggest that the adaptation algorithm is operational, but are too statis- null tically weak to draw any firm conclusions. They suggest that another speaker's SD models may give poor initial performance, but are improved significantly by the rapid enrollment process. Both SI models perform better initially, but are only improved a small amount by the enrollment. All three sets of rapid-enrolled models gave similar performance. And, as usual, SD models, given enough training data, yield the best performance.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML