File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/89/h89-2037_evalu.xml

Size: 5,199 bytes

Last Modified: 2025-10-06 14:00:02

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2037">
  <Title>TOWARDS SPEECH RECOGNITION WITHOUT VOCABULARY-SPECIFIC TRAINING</Title>
  <Section position="7" start_page="272" end_page="273" type="evalu">
    <SectionTitle>
5. Experiments and Results
</SectionTitle>
    <Paragraph position="0"> We used a version of SPHINX for the experiments on our VI models. Since SPHINX is described elsewhere in these proceedings \[8\], we will not be repetitive here. We note, however, that between-word triphone models \[9\] and corrective training \[10\] were not used in this study. More detailed descriptions of SPHINX can be found in \[ 1, 2\].</Paragraph>
    <Paragraph position="1"> We used 90 sentences from 10 speakers from the voice calculator task for evaluation. The following training sets were used: VI-5000 Approximately 5000 sentences from resource management. The triphone coverage on the voice calculator task is 63.7%, and word coverage is 44.3%.</Paragraph>
    <Paragraph position="2"> HARV-TIMIT Approximately 5000 sentences from Harvard and TIMIT database. Triphone coverage is 91.9% and word coverage is 53.3% GENENG Approximately 5000 sentences from general English database. Triphone coverage is 96.9% and word coverage is 65.6% VI-10000 Approximately 10,000 sentences from resource management, TIM1T, and Harvard. Triphone coverage is 95.3%, and word coverage is 63.9%.</Paragraph>
    <Paragraph position="3"> VI-15000 Approximately 15,000 sentences from resource management, TIM1T, Harvard, and general English. Triphone coverage is 99.2%, and word coverage is 70.5%.</Paragraph>
    <Paragraph position="4"> VD-1000 Approximately 1000 sentences from voice calculator training. Triphone coverage is 100%, and word coverage is 100%.</Paragraph>
    <Paragraph position="5"> Our first experiment used 48 phonetic models, trained from each of the above four training sets, and tested them on the voice calculator task. Table 1 shows the accuracy of phone models. Although phones are well-covered in each of the three VI databases, the VD results are still much better than the VI results. This is due to the fact that the voice calculator has a small vocabulary, and the VD phone models were able to model the few contexts in this vocabulary well.  Next, we trained generalized triphone models on the four training databases. For each VI training set, we chose an appropriate number of generalized triphones to train from the training corpus. Then, for each phone in the voice calculator task, if the triphone context was covered, we mapped it to a generalized triphone. Otherwise, we used the corresponding context-independent phone. For vocabulary-dependent training, we felt that sufficient training was available for all triphones, so no generalization was performed, and we used VD triphone models. In all four cases, the trained model parameters were interpolated with context-independent phone models to avoid insufficient training. The results of these models are shown in Table 2. Also shown in Table 2 are the triphone and word coverage statistics using the above four training databases. Note that as training data is increased, triphone coverage improves more rapidly than word coverage. With 15,000 training sentences, almost all triphones are covered and the result is close to that from VD training with 1000 training sentences. Moreover, the result of GENENG which only conatins 5000 sentences is almost the same as that of VI-10000 which contains 10000 sentences, because the triphone coverage of GENENG is better. Therefore, to cover as many as triphone contexts is crucial for Vocabulary-Independent training.</Paragraph>
    <Paragraph position="6">  and the VD models. Assuming that we have a set of VI models trained from a large training database, and a vocabulary-dependent training set, we use the following algorithm to utilize both training sets:  1. Initialization - Use the VI models to initialize VD training. As before, for each phone in the voice calculator task, if the triphone is covered, then it is used to initialize that triphone. Otherwise, the corresponding context-independent phone is used.</Paragraph>
    <Paragraph position="7"> 2. Training - Run the forward-backward algorithm on the VD sentences to train a set of VD models.</Paragraph>
    <Paragraph position="8"> 3. Interpolation - Use deleted interpolation \[11, 1\]  to combine the appropriate task-specific VD models with the robust task-independent VI models.</Paragraph>
    <Paragraph position="9"> Table 3 shows results using the above interpolation algorithm. We found that the combination of the VI models from 15,000 sentences and the VD models from 1000 sentences can reduce the error rate by 18% over VD training alone. This algorithm can be used to improve any task-dependent recognizer given a set of VI models. Also, these results show that vocabulary-adaptation is promising.</Paragraph>
    <Paragraph position="10">  dependent models interpolated with vocabulary-independent models.</Paragraph>
    <Paragraph position="11"> We have begun to experiment without grammar; however, at the time of this writing, the results with VI models are not as good relative to the VD models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML