XML Viewer - h92-1035

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1035_metho.xml
Size: 12,288 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1035">
  <Title>Improving State-of-the-Art Continuous Speech Recognition Systems Using the N-Best Paradigm with Neural Networks</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
DARPA Resource Management Corpus \[1\]. Also, the train-
</SectionTitle>
    <Paragraph position="0"> ing of the neural net was performed only on the correct transcription of the utterances. In this paper, we describe the performance of the hybrid system on the speaker-independent portion of the Resource Management corpus, using discriminative training on the whole N-best list. Below, we give a description of the SNN, the integration of the SNN with the HMM models using the N-best paradigm, the training of the hybrid SNN/I-IMM system using the whole N-best list, and the results on a development set.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="180" type="metho">
    <SectionTitle>
SEGMENTAL NEURAL NET STRUCTURE
</SectionTitle>
    <Paragraph position="0"> The SNN differs from other approaches to the use of neural networks in speech recognition in that it attempts to recognize each phoneme by using all the frames in a phonetic segment simultaneously to perform the recognition. The SNN is a neural network that takes the frames of a phonetic segment as input and produces as output an estimate of the probability of a phoneme given the input segment.</Paragraph>
    <Paragraph position="1"> But the SNN requires the availability of some form of phonetic segmentation of the speech. To consider all possible segmentations of the input speech would be computationally prohibitive. We describe in Section 3 how we use the HMM to obtain likely candidate segmentations. Here, we shall assume that a phonetic segmentation has been made available.</Paragraph>
    <Paragraph position="2"> The structure of a typical SNN is shown in Figurel. The input to the net is a fixed number of frames of speech features (5 frames in our system). The features in each 10-ms frame consist of 16 scalar values: 14 reel-warped cepstral coefficients, power, and power difference. Thus, the input to the SNN consists of a total of 80 features. But the actual number of actual frames in a phonetic segment is variable. Therefore, we convert the variable number of frames in each segment to a fixed number of frames (in this case, five frames). In this way, the SNN is able to deal effectively with variable-length segments in continuous speech. The requisite time warping is performed by a quasi-linear sampling of the feature vectors comprising the segment. For example, in a 17-frame phonetic segment, we would use frames 1, 5, 9, 13, and 17 as input to the SNN. In a 3-frame segment, the five frames used are 1, 1, 2, 3, 3, with a repetition of the  frames in a segment and produces a single segment score.</Paragraph>
    <Paragraph position="3"> first and third frames. In this sampling, we are using a result from Stochastic Segment Models (SSM) \[5\] in which it was found that sampling of naturally-occurring frames gives better results than strict linear interpolation.</Paragraph>
    <Paragraph position="4"> Since there are 53 phonemes defined in our system, we used SNNs with 53 outputs, each representing one of the phonemes in the system.</Paragraph>
  </Section>
  <Section position="6" start_page="180" end_page="181" type="metho">
    <SectionTitle>
THE N-BEST RESCORING PARADIGM
</SectionTitle>
    <Paragraph position="0"> without an algorithm that can efficiently search all word-sequence and segmentation posibilities in a large-vocabulary CSR system, the amount of computation required to incorporate the SNN into such a system would be prohibitive.</Paragraph>
    <Paragraph position="1"> However, it is possible to use the N-best paradigm to make such an incorporation feasible.</Paragraph>
    <Paragraph position="2"> The N-best paradigm \[7,6\] was originally developed at BBN as a simple way to ameliorate the effects of errors in speech recognition when integrating speech with natural language processing. Instead of producing a single sequence words, the recognition component produces a list of N best-scoring sequences. %The list of N sentences is ordered by over-all score in matching the input utterance. For integration with natural language, we send the list of N sentences to the natural language component, which processes the sentences in the order given and chooses the highest scoring sentence that can be understood by the system. However, we found that the N-best paradigm can also be very useful for improving speech recognition performance when more expensive sources of knowledge (such as cross-word effects and higher-order statistical grammars) cannot be computed efficiently during the recognition. All one does is rescore the N-best list with the new sources of knowledge and re-order the list. The SNN is a good example of an expensive knowledge source, whose use would benefit greatly from us- null ing N-best rescoring, thus comprising a hybrid SNN/HMM HMM rescorlng ~ SNN segmentation rescori ng and I labels I HMM SC~ ~ scores r I tdeg&amp;quot;C deg'deg&amp;quot;  tem using the Nzbest rescofing paradigm.</Paragraph>
    <Paragraph position="3"> Figure 2 shows a block diagram of the hybrid SNN/HMM system. A spoken utterance is processed by the HMM recognizer to produce a list of the N best-scoring sentence hypotheses. The length of this list is chosen to be long enough to almost always include the correct answer (from experience, N=20 is usually sufficien0. Thereafter, the recognition task is reduced to selecting the best hypothesis from the N-best list. Because these N-best lists are quite short (e.g., N=20), each hypothesis can be examined and scored using algorithms which would have been computafionally impossible with a search through the entire vocabulary. In addition, it is possible to generate several types of scoring for each hypothesis. This not only provides a very effective means of comparing the effectiveness of different speech models (e.g., SNN versus HMM), but it also provides an easy way to combine several radically different models.</Paragraph>
    <Paragraph position="4"> One most obvious way in which the SNN could use the N-best list would be to use the HMM system to generate a segmentation for each N-best hypothesis (by finding the most likely HMM state sequence according to that hypothesis) and to use the SNN to generate a score for the hypothesis using this segmentation. This SNN score for a hypothesis is the logarithm of the product of the appropriate SNN outputs for all the segments in a segmentation according to that hypothesis. The chosen answer would be the hypothesis with the best SNN score. However, it is also possible to  generate several scores for each hypothesis, such as SNN score, HMM score (which is the logarithm of the HMM likelihood), grammar score, and the hypothesized number of words and phonemes. We can then generate a composite score by, for example, taking a linear combination of the individual scores. After we have rescored the N-Best list, we can reorder it according to the new scores. If the CSR system is required to output just a single hypothesis, the highest scoring hypothesis is chosen. We call this whole process the N-best rescoring paradigm.</Paragraph>
    <Paragraph position="5"> The linear combination that comprises the composite score is determined by selecting the weights that give the best performance over a development test set. These weights can be chosen automatically \[4\]. The number of words and phonemes are included in the composite score because they serve the same pu~ose as word and phoneme insertion penalties in a HMM CSR system.</Paragraph>
  </Section>
  <Section position="7" start_page="181" end_page="182" type="metho">
    <SectionTitle>
SEGMENTAL NEURAL NET TRAINING
1-Best Training
</SectionTitle>
    <Paragraph position="0"> In our original training algorithm, we first segmented all of the training utterances into phonetic segments using the HMM models and the utterance transcriptions. Each segment then serves as a positive example of the SNN output corresponding to the phonetic label of the segment and as a negative example for all the other 52 phonetic SNN outputs.</Paragraph>
    <Paragraph position="1"> We call this training method 1-best training.</Paragraph>
    <Paragraph position="2"> The SNN was originally trained using a mean-square error (MSE) criterion - i.e., the SNN was trained to minimize</Paragraph>
    <Paragraph position="4"> where yc(n) is the network output for phoneme class c for the n m training vector and de(n) is the desired output for that vector (l if the segment belongs to class c and 0 otherwise).</Paragraph>
    <Paragraph position="5"> This measure can lead to gross errors at low values of yc(n) when segment scores are multiplied together. Accordingly, we adopted the log-error training criterion \[3\], which is of the form</Paragraph>
    <Paragraph position="7"> This can be shown to have several advantages over the MSE criterion. When the neural net non-linearity is the usual sigmoid function, this error measure has only one minimum for single layer nets. In addition, the gradient is simple and avoids the problem of &amp;quot;weight locking&amp;quot; (where large errors do not change because of small gradients in the sigmoid).</Paragraph>
    <Paragraph position="8"> Duration Because of the time-waiping function (which transforms phonetic segments of any length into a fixed-length representation), the SNN score for a segment is independent of  the duration of the segment. In order to provide information about the duration to the SNN, we constructed a simple durational model. For each phoneme, a histogram was made of segment durations in the training data. This histogram was then smoothed by convolving with a triangular window, and probabilities falling below a floor level were reset to that level. The duration score was multiplied by the neural net score to give an overall segment score.</Paragraph>
    <Paragraph position="9"> N-best Training In our latest version of the training algorithm, we take the N-best paradigm a step further and perform what we call N-best training, which is a form of discriminative training. First, we take the HMM-based segmentations of the training utterances according to the correct word sequence. These segments are used only as positive examples (i.e., trained to output 1) for the appropriate SNN outputs.</Paragraph>
    <Paragraph position="10"> We then produce the N-best lists for all of the training sentences. For each of the incorrect hypotheses in the N-best list, we obtain the H/VIM-based segmentation and isolate those segments that differ from the segmentation according to the correct transcription and use them as negative training for the SNN outputs (i.e., trained to output 0). Thus we train negatively on the &amp;quot;misrecognized&amp;quot; parts of the incorrect hypothesis.</Paragraph>
    <Paragraph position="11">  reject those segments from an incorrect hypothesis that the HMM consioders likely This new method has the advantage that the SNN is specifically trained to discriminate among the choices that the HMM system considers diflicult. This is better than the 1-best training algorithm, which only uses the segmentation of the correct utterance transcription, because N-best training directly optimizes the performance of the SNN in the N-best rescoring paradigm.</Paragraph>
    <Paragraph position="12"> If, for example, the transcription of part of an utterance &amp;quot;... carriers were in Atlantic...&amp;quot; and a likely N-best hypothesis was &amp;quot;... carriers one Atlantic...&amp;quot; the segments corresponding to the word &amp;quot;one&amp;quot; (as generated by a constrained HMM alignment) would be presented to the SNN as nega- null tive training. To determine ff a segment should be presented to the SNN as negative input, the label and position of each segment in a hypothesis is compared to the segments seen in the correct segmentation of the utterance. If either the label or the position (subject to a tolerance threshold) of the segment does not match a segment in the correct segmentation, it is presented as negative training.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML