XML Viewer - h94-1081

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1081_metho.xml
Size: 14,056 bytes
Last Modified: 2025-10-06 14:13:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1081">
  <Title>Is N-Best Dead?</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2. 3-PASS N-BEST SEARCH STRATEGY
</SectionTitle>
    <Paragraph position="0"> The BYBLOS system has been described previously (e.g., \[3\]). We reiterate here the use of the N-Best Paradigm in that system.</Paragraph>
    <Paragraph position="1"> The decoder used a 3-pass search strategy. The strategy used a forward pass followed by a backward Word-Dependent N-Best search algorithm \[4\] using a bigram language model, within-word triphone models, and top-1 (discrete VQ) densities. The N-Best hypotheses were then rescored using cross-word triphone context models, top-5 mixture densities, and trigram language model.</Paragraph>
    <Paragraph position="2"> Typically, the backward Word-Dependent N-Best pass requires about half the time required by the forward pass. Rescoring each alternative sentence hypothesis individually with cross-word triphone models only requires about 0.2 seconds per hypothesis. And rescofing the text of the hypotheses with a high-order n-gram language model \[5\] requires essentially no time.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="412" type="metho">
    <SectionTitle>
3. ADMISSIBILITY
</SectionTitle>
    <Paragraph position="0"> It has often been asserted that the N-Best paradigm is inadmissible because when the initial N-Best list is created using weaker knowledge sources, then the answer that would have had the highest score using the stronger knowledge sources might not be within the list of alternatives, and therefore never have a chance to be rescored. This would be especially likely when the error rate is high and the utterances  are long, since the number of alternative sentences needed to include the correct answer would grow exponentially with the length of the utterance.</Paragraph>
    <Paragraph position="1"> The knowledge sources (e.g., cross-word triphones and tri-gram language models) used for rescoring in the 3-Pass N-Best strategy described above were much more powerful than the original knowledge sources (e.g, within-word triphones and bigram language models) in that they frequently reduced the error rate by half. However, we had assured ourselves that, at least for moderate-size problems (like ATIS with 2,000 words or WSJ with 5,000 words), there were few if' any additional errors caused by the correct answer not being included in the N-Best list.</Paragraph>
    <Paragraph position="2"> However, after the November 1992 DARPA Continuous Speech Recognition (CSR) evaluations, we were concerned that we might be losing some performance as a result of our use of the 3-Pass N-Best strategy (rescoring with cross-word triphones, top-5 mixture densities, and trigrarn language models) on the 20,000 words WSJ test. This was because there were many sentences for which the correct answer was not in the N-Best hypotheses although it bad a higher total score (when including the trigram language model and cross-word triphones) than any sentence hypothesis in the N-Best list. We felt that this was due to the higher word error rate that resulted from recognition with a large vocabulary of 20,000 words, and the long utterances found in the Wall Street Journal (WSJ) corpus.</Paragraph>
    <Paragraph position="3"> Therefore, this year we implemented a more complicated search strategy similar to the Progressive-Search strategy suggested by Murveit \[6\] in which we use the initial passes to create a lattice of alternative hypotheses, which can then be rescored. The advantage of this approach is that a lattice with a small number of alternatives at each point can represent a very large number of alternative sentence hypotheses. In addition, rescoring the lattice of alternatives is computationally less expensive than rescoring a large explicit list of sentence alternatives. This also avoids the rather large intermediate storage required to store the N-Best hypotheses.</Paragraph>
    <Section position="1" start_page="411" end_page="412" type="sub_section">
      <SectionTitle>
3.1. 4-Pass Lattice Search Algorithm
</SectionTitle>
      <Paragraph position="0"> In this section we describe a 4-Pass Lattice Search algorithm that avoids the early use of the N-Best.</Paragraph>
      <Paragraph position="1">  1. The time-synchronous beam search algorithm with a vocabulary of 20,000 words and a bigram language model typically requires substantial computation on today's worksta null tions. Therefore, we make extensive use of the Normalized Forward Backward Search aigonthm \[8\] to reduce computation. We perform a first pass using a fast match technique whose sole purpose is to find and save high scoring word ends. Because this model is approximate, it can run considerably faster than the usual beam search. And because the later passes will be more accurate, the first pass need not be as accurate.</Paragraph>
      <Paragraph position="2">  2. A second pass, time-synchronous beam search, using a bigram language model, within-word triphones, and (top-1 VQ) discrete models runs backward. This pass is sped up by several orders of magnitude by using the Normalized Forward Backward pruning on the word-ending scores produced by the first pass. We save the beginning times and scores (fl~,) of all words found. This pass requires much less time than the first pass.</Paragraph>
      <Paragraph position="3"> 3. A third pass identical to the second pass runs forward, using the Normalized Forward Backward pruning on the wordbeginning scores produced by the second pass. Similar to t the second pass, we save the ending times and scores (t~,~) of all words found (constrained by the second pass).</Paragraph>
      <Paragraph position="4"> 4. We use the beginning (fit .) and ending (a~ .) scores from  passes 2 and 3 to determine possible word-j~ncture times. Specifically, if the forward-backward score for a particular pair of words is within a threshold of the total score for the utterance, then this word-pair is used. That is if ' Pr(wilw./) ft., &gt; A C/~tu~j where Pr(wilw~) is the probability of wj followed by wi, and A is the threshold (which can be a function of either c~ or j~).</Paragraph>
      <Paragraph position="5"> Adjacent word-junctures are merged. Having found a wordpair, we look for the next word-juncture where this second word is the first word of the next pair. The result is a lattice of word hypotheses. If the range of beginning and ending times for a single word overlap, then we create a loop for that word.</Paragraph>
      <Paragraph position="6"> The word lattice (which is really just a small finite-state language model) is then expanded to allow for maintaining separate scores for trigrarn language models and cross-word triphones. This entails copying each word in the context of each preceding word, and replacing the triphones on either  side of the word junctures with the appropriate cross-word triphones. Thus, each word in the lattice represents a particular instance of that word in the context of some particular other word. The transition probabilities in the lattice are the probability of the next word given the previous two words - trigram probabilities.</Paragraph>
      <Paragraph position="7">  We perform a fourth pass in the backward direction using this expanded language model. The result is the most likely hypothesis including cross-word and trigram knowledge sources.</Paragraph>
      <Paragraph position="8"> However, we are not done at this point, because we may want to apply more powerful, but more expensive, knowledge sources. We generate the N-Best alternative hypotheses out of the search on this lattice. The Word-Dependent N-Best algorithm \[4\] requires that we keep separate scores  at each state depending on the previous word, because the boundary between two words clearly depends on those two words. But the words in the lattice are only defined in the context of the neighboring word. Thus, by keeping the scores of all of the ending word hypotheses, we can recover the N-Best alternatives. However, in contrast to its previous use, these N-Best answers have been computed including the more powerful knowledge sources of cross-word coarticulation models and trigram language models.</Paragraph>
    </Section>
    <Section position="2" start_page="412" end_page="412" type="sub_section">
      <SectionTitle>
3.2. Experimental Results
</SectionTitle>
      <Paragraph position="0"> We performed an experiment in which we compared the recognition accuracy of this 4-Pass Lattice approach with the previous 3-Pass N-Best approach. In both cases, the initial search (in order to create the lattice or to find the N-Best sentence hypotheses) used only a bigram language model and within-word coarticulation models with topl-VQ discrete densities, while the final search (on the lattice) or rescoring (the N-Best) used a trigram language model and between-word coarticulation models with top-5 mixture densities. null Initially, we were surprised to find that the accuracy using the lattice was actually slightly worse than that of the original N-Best method. Then, we realized that this was due to the larger number of alternatives. A lattice with an average depth (the average number of branches out of a word-end node) of 10 for a sentence of 20 words can be thought of as an N-Best list with 1020 hypotheses. When we had previously found that, in the 3-Pass N-Best approach, the correct utterance might have a higher score than the answers in the top 100 best hypotheses, there were also countless other incorrect hypotheses, in the 4-Pass Lattice approach, that also had higher scores than the answers in the original N-Best.</Paragraph>
      <Paragraph position="1"> The search on the lattice often found one of these other incorrect answers.</Paragraph>
      <Paragraph position="2"> We alleviated this problem by optimizing (automatically) the weights (for trigram language model, word insertion penalty and phone insertion penalty) using the N-Best alternative hypotheses found after the lattice search. These new weights were then used to search the lattice again. Finally, we were able to obtain 5% fewer word errors using the 4-Pass Lattice strategy than when using the 3-Pass N-Best approach. This was a much smaller reduction in error than we had hoped for. Apparently the reduced search errors were largely offset by the larger search space on the lattice.</Paragraph>
      <Paragraph position="3"> It would appear, therefore, that the doom and gloom predictions for N-Best are unfounded so far, at least for the 20,000 WSJ task. In fact, the N-Best paradigm continues to offer advantages not available otherwise, as mentioned below.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="412" end_page="413" type="metho">
    <SectionTitle>
4. CURRENT USES FOR N-BEST
</SectionTitle>
    <Paragraph position="0"> While it is possible to expand a lattice of alternatives for resconng with trigram language models, there are still many knowledge sources that are too expensive to use this way.</Paragraph>
    <Paragraph position="1"> For example, for the November of 1993 evaluations, we included a model of whole segments (Segmental Neural Network \[10\]). And Boston University also rescored our N-Best hypotheses with a similarly motivated Stochastic Segment Model \[9\]. Both of these models are much more expensive than HMM models due to their constrained slope features and global dependence. Either of these models reduce the word error rate by about 10% in combination with the HMM scores. We also experimented with a more complex HMM topology for a phoneme that includes thirteen states instead of the usual three or five states. While this model could have been integrated directly, it was much easier and faster to simply restore the N-Best hypotheses with this larger model. The resulting small reduction in error rate would not have been worth the larger computation and storage associated with using it in the original search, if not to mention the time of implementation to integrate these models into the search.</Paragraph>
    <Paragraph position="2"> Also for the 1993 evaluations on the ATIS domain, we found that we could reduce the word error rate by 8% by rescoring the N-Best hypotheses with a four-gram class language model. Again, expanding the word lattice for a four-gram language model would have been possible, but would have resulted in a huge lattice with the same word replicated many times. But resconng the N-Best hypotheses with four-grams required almost no computation and did not require rerunning the recognition.</Paragraph>
    <Paragraph position="3"> There is a tremendous advantage in being able to define any scoring function without having to get involved with the details of a general search strategy since only one linear sequence of words need be scored at one time.</Paragraph>
    <Paragraph position="4"> In combining these various experinaental knowledge sources, it is important that they be weighted appropriately, or else there may be no gain, or even a loss. Optimizing the weights for several knowledge sources on a development test set of several hundred sentences can be accomplished in seconds or minutes on the N-Best hypotheses rather than days by explicit experimentation.</Paragraph>
    <Paragraph position="5"> And of course, we still use the N-Best paradigm to combine the speech recognition with the language understanding component. It would be infeasible to use the entire constrained space defined by the understanding model in the speech recognition search. But it is a trivial matter to provide several (5 to 10) alternatives to the understanding component for its choice. Again, in this year as in the past, we also provided the N-Best alternatives output from our speech recognition system to the language understanding group at Paramax. This simple text-based interface makes arbitrary integration simple. The integration between two sites across the ARPA network was quite straightforward.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML