File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/89/h89-2037_concl.xml
Size: 2,505 bytes
Last Modified: 2025-10-06 13:56:27
<?xml version="1.0" standalone="yes"?> <Paper uid="H89-2037"> <Title>TOWARDS SPEECH RECOGNITION WITHOUT VOCABULARY-SPECIFIC TRAINING</Title> <Section position="8" start_page="273" end_page="273" type="concl"> <SectionTitle> 6. Conclusion and Future Work </SectionTitle> <Paragraph position="0"> This paper addressed the issue of vocabulary-independent subword modeling. Vocabulary independence is important because the overhead of vocabulary-dependent training is very high. Yet, vocabulary-independent subword models must be consistent, trainable, and generalizable. We believe this requires a large training database and a set of flexible subword units. To this end, we have collected a large multi-speaker database, from which we trained generalized triphone models. We found that with sufficient training, over 99% triphone coverage of the testing vocabulary can be attained.</Paragraph> <Paragraph position="1"> We report a vocabulary-dependent word accuracy of 88.6%, while the best vocabulary-independent models led to 86.7%.</Paragraph> <Paragraph position="2"> In another experiment, we found that it is possible to reduce the vocabulary-dependent error rate by 18% (to 90.7%) by interpolating the vocabulary-dependent models with the vocabulary-independent ones.</Paragraph> <Paragraph position="3"> These results are very encouraging. In the future, we hope to further enlarge our multi-speaker database. As this database grows, we hope to model other acoustic-phonetic detail such as stress, syllable position, between-word phenomena, and units larger than triphones. To reduce the large number of resultant models, we will first use phonetic knowledge to identify the relevant ones, and then apply the clustering technique used in generalized triphones to reduce these detailed phonetic units into a set of generalized allophones. We will also experiment with top-down clustering of allophones. While the top-down approach may lead to less &quot;optimal&quot; clusters, it has more potential for generalization in spite of poor coverage.</Paragraph> <Paragraph position="4"> The choice of speaker-independence gives us the luxury of plentiful training. We believe that the combination of knowledge and subword clustering will lead to subword models that are consistent, trainable, and generalizable. We hope that plentiful training, careful selection of contexts, and automatic clustering can compensate for the lack of vocabulary-specific training.</Paragraph> </Section> class="xml-element"></Paper>