XML Viewer - h91-1055

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1055_metho.xml
Size: 6,344 bytes
Last Modified: 2025-10-06 14:12:44
<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1055">
  <Title>Training Vocab. Unknown</Title>
  <Section position="3" start_page="285" end_page="285" type="metho">
    <SectionTitle>
THE STACK DECODER
</SectionTitle>
    <Paragraph position="0"> A copy of the Lincoln time synchronous (TS) decoder HMM CSR\[18, 20, 21\] has been converted to a stack decoder\[2, 8, 24\] using the optimal A* search\[19\]. The current prototype supports most of the features of the current TS recognizer\[20, 21\]. It uses multiple observation streams, has adaptive background model estimation, optional interword silences, and context dependent phone modeling. The current implementation can be used with or without a language model. The language models are integrated using the CSR-NL interface\[17\]. Language model modules have been built for word-pair and bigram/trigram back-off language models. Unlike the TS decoder implementation which is limited to outputting the single best sentence theory, the stack decoder implementation can output a top-N sentence theory list.</Paragraph>
    <Paragraph position="1"> The current prototype does not yet use tied mixtures\[4, 7\] (TM). For simplicity, it uses Top-1 observation pruning mode, which is equivalent to a discrete observation recognizer using observation pdfs generated by a TM trainer.</Paragraph>
    <Paragraph position="2"> (This was done to delay dealing with the issue of caching the mixture sums. The changes required to convert to an inefficient TM system are trivial.) It also does not yet use cross-word phone modeling pending solution of several implementation issues.</Paragraph>
    <Paragraph position="3"> The system includes a tree-structured fast match\[l, 3\] to reduce the computation required by the detailed acoustic match. This fast match uses a beam-pruned TS search of a phonetic tree built from HMM phone models to locate the the possible one word extensions to the current theory (partial sentence). To reduce computation, only the observation pdfs are used--the transition probabilities are ignored. The current fast match reduces the number of possible next words to about 15% of the vocabulary for the SD (RM) task using triphone models.</Paragraph>
    <Paragraph position="4"> The system has been tested in no-grammar (NG) mode using SD triphone models on the RM task. In some respects, the stack decoder does a better job than does the TS decoder. It sometimes locates a higher probability path through the recognition network than does the TS decoder with a reasonable pruning threshold. (Unfortunately, these paths usually contain a recognition error.) The fixed pruning threshold of the TS decoder terminates these paths, while the stack decoder continues to extend them. If the pruning threshold of the TS system is increased to a value that would ordinarily be considered excessive, the TS system will also find these paths. The stack decoder automatically finds these paths because it has, in effect, adaptive pruning which does not require any fixed thresholds.</Paragraph>
    <Paragraph position="5"> The system has also been tested using WPG and N-gram back-off language models through the CSR-NL interface.</Paragraph>
    <Paragraph position="6"> The potential search errors due to an interaction between the acoustic and language models described and verified by simulation in \[19\] have been observed on real speech data. In fact, this search error can be caused by the word insertion penalty alone. (This appears to be rather infrequent--it was forced experimentally by using a relatively large insertion penalty.) Initial checks for the search error, which used the WPG without probabilities, indicated that the interaction was a minor problem. (The WPG without probabilities gave better recognition results than the WPG with 1 probabilities in the TS recognizer.) branchino-\]actor In contrast to NG and the WPG, the interaction becomes a major problem when one of the stochastic (bigram or trigram back-off) language models is used. A simple, but inefficient solution is to increase the tie-breaking factor in the equation for the stack ordering criterion\[19\]: StSc, =mtax L,(t) - l~bL(t) - et where Li(t) is the likelihood of theory i and lubL(t) is the least-upper-bound-so-far on the theory likelihoods and e is the tie-breaking factor which favors shorter over longer theories. (For the NG case, * need only be a very small number greater than zero.) This is a poor solution because a large value of * will also greatly increase the computation. When a sufficiently large value of * is used, the stack decoder functions as expected with either the trigram or bigram back-off language models. Another possible solution is to run the decoder in top-N mode and select the best sentence after some number of sentences has been output. This approach also significantly increases the computation.</Paragraph>
    <Paragraph position="7"> It appears likely that the interaction problem is due to combining two fundamentally different types of score (logprobabilities) together. The HMM observation and transition probabilities (i.e. the acoustic scores) are accumulated as a function of time and the language model probabilities are accumulated as a function of the number of states traversed. The mixture of these dissimilar scores appears to be damaging the estimation of the least upper bound used to perform the A* search\[16\] in the stack decoder.</Paragraph>
    <Paragraph position="8"> On the average, the current prototype stack decoder runs significantly faster than does the TS decoder on a SD no-grammar task. This is probably due to the adaptivepruning-threshold like behavior of the stack decoder which allows it to pursue the minimum number of theories required to decode the sentences--a small number on the &amp;quot;easy&amp;quot; sentences and a larger number on the &amp;quot;harder&amp;quot; sentences, whereas the TS decoder must always use a worst-case pruning threshold. With the word-pair grammar, the two systems run at similar speeds. The strategies proposed above for combating the interaction problem with the stochastic language models slow the stack decoder sufficiently to cause it to run to significantly slower than the TS decoder. The strategies also increase the number of theories which must be held on the stack.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML