File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-2037_metho.xml

Size: 12,008 bytes

Last Modified: 2025-10-06 14:12:20

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2037">
  <Title>TOWARDS SPEECH RECOGNITION WITHOUT VOCABULARY-SPECIFIC TRAINING</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> One of the most exciting and promising areas of speech research is large-vocabulary continuous speech recognition.</Paragraph>
    <Paragraph position="1"> A myriad of applications await a good speech recognizer.</Paragraph>
    <Paragraph position="2"> However, while many reasonable recognizers exist today, they are impractical and inflexible due to the tedious process of configuring a recognizer. This tedium is typically embodied in one of the following forms: * Speaker-specific training: each speaker must speak for about an hour to train the system.</Paragraph>
    <Paragraph position="3"> * Vocabulary-specific training: with each new vocabulary comes the dilemma of tedious retraining for optimal performance, or tolerating substantially higher error rate.</Paragraph>
    <Paragraph position="4"> * Training time: with each new speaker or vocabulary, many hours are needed to process the speech and train the system.</Paragraph>
    <Paragraph position="5"> Recent work at Carnegie Mellon \[1, 2\] and several other laboratories has shown that highly accurate speaker-independent speech recognition is possible, thus alleviating the need for speaker-dependent training. However, these speaker-independent systems still need vocabulary-dependent training on a large population of speakers for each vocabulary, which requires a very large amount of time for data collection (weeks to months), dictionary generation (days to weeks), and processing (hours to days).</Paragraph>
    <Paragraph position="6"> As speech recognition flourishes and new applications emerge, the demand for vocabulary-specific training will become the bottleneck in building speech recognizers. In this paper, we will discuss our initial work to alleviate the tedious vocabulary-specific training process.</Paragraph>
    <Paragraph position="7"> Our work thus far has involved collecting and processing a large general English database, and evaluating the generalized triphone \[3, 2\] as a vocabulary independent unit. We collected and trained generalized triphone models on up to 15,000 training sentences, and compared our results to that from vocabulary-dependent models. We found that as we increased VI training data, VI generalized triphones improved from 109% more errors than vocabulary-dependent training to only 16% more errors. In another vocabulary-adaptation experiment, we found that interpolating vocabulary-dependent models with vocabulary-independent models reduces the error rate by 18%.</Paragraph>
    <Paragraph position="8"> Based on the enouraging results of this preliminary study, we conjecture that generalized triphones are a reasonable starting point in our search for a more vocabulary-independent subword unit. In the future, we hope to further increase our training database. With increased training data will come the ability to train more detailed subword models. We expect that this combination will enable us to further improve our results.</Paragraph>
    <Paragraph position="9"> In this paper, we will first discuss the issue of VI models. Next, we will briefly describe generalized triphones. Then, we will describe our databases and experimental results.</Paragraph>
    <Paragraph position="10"> Finally, we will close with some concluding remarks about this and future work.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="271" type="metho">
    <SectionTitle>
2. Vocabulary-Independent Subword Modeling
</SectionTitle>
    <Paragraph position="0"> Subword modeling has become an increasingly more important issue because as the vocabulary capacity of recognizers increases, it becomes difficult, if not impossible, to train whole-word models. Many subword modeling techniques have been proposed (see \[4\] for a survey on these techniques). However, most subword models were evaluated using the same vocabulary for training and testing. An important question that has often been ignored is: how well will these subword models perform under vocabulary-independent conditions? In other words, if we train on one vocabulary and test on another, will the performance degrade considerably? If so, it will then be necessary to retrain for each new vocabulary, which is time-consuming, tedious, and costly.</Paragraph>
    <Paragraph position="1"> Why should performance degrade across vocabularies? There are two main causes: the lack of coverage and the inconsistency of the models. The coverage problem is caused  by the fact that the phonetic events in the testing vocabulary are not covered by the training vocabulary. This lack of coverage makes it impossible to train the models needed for the testing vocabulary. Instead, we must improvise with a more general model. For example, if the phone/t/ in the triphone context of/s t r/ occurs in testing but not in training, it will be necessary to use a more general model, such as/t/ in the context of/s t/, or/t r/, or even a context-independent/t/.</Paragraph>
    <Paragraph position="2"> The problem of improvising with general models is that they may become inconsistent. That is, the same model may generate many different realizations. For example, if context-independent phone models are used, the same model for/t/ must capture various events, such as flapping, unreleased stop, and realizations in / t s/and/t r/. Then, if/t s/ is the only context in which/t/occurs in the training, while /t r/is the only context in the testing, the model used will be highly inappropriate.</Paragraph>
    <Paragraph position="3"> To ensure that the models are consistent and that new contexts are covered, it is necessary to account for all causes of phonetic variability. However, the enumeration of all the causes* will lead to an astronomical number of models. This makes the models untrainable, which renders them powerless.</Paragraph>
    <Paragraph position="4"> In view of the above analysis, we believe that a successful approach to vecabulary-independent subword modeling must use models that are consistent, trainable, and generalizable.</Paragraph>
    <Paragraph position="5"> Consistency means the variabilities within a model should be minimized; trainability means there should be sufficient training data for each model; and generalizability means reasonable models for the testing vocabulary can be used in spite of the lack of precise coverage in the training.</Paragraph>
  </Section>
  <Section position="5" start_page="271" end_page="271" type="metho">
    <SectionTitle>
3. Generalized Triphones
</SectionTitle>
    <Paragraph position="0"> In this section, we describe the basis of our current work generalized triphone models, which are based on triphone models \[5\]. Triphones account for the left and right phonetic contexts by creating a different model for each possible context pair. Since the left and right phonetic contexts are the most important factors that affect the realization of a phone, triphone models are a powerful technique and have led to good results. However, there are a very large number of triphones, which can only be sparsely trained. Moreover, they do not take into account the similarity of certain phones in their effect on other phones (such as /b/ and /p/ on vowels).</Paragraph>
    <Paragraph position="1"> In view of this, we introduce generalized triphone models \[3\]. Generalized triphones are created from triphone models using a clustering procedure that combines triphone HMMs according to a maximum likelihood criterion. In other words, we want to cluster triphones into a set of generalized *A paxtial list might include: ph(metic contexts, articulator position, stress, semantics, prosody, intonation, dialect, accent, loudness, speaking-rate, speaker anat(mly, ete.</Paragraph>
    <Paragraph position="2"> triphones that will have as high as possible a probability of generating the training data. This is consistent with the maximum-likelihood criterion used in the forward-backward algorithm.</Paragraph>
    <Paragraph position="3"> Context generalization provides the ideal means for finding the equilibrium between trainability and consistency. Given a fixed amount of training data, it is possible to find the largest number of trainable models that are consistent. Moreover, it is easy to incorporate other causes of variability such as stress, syllable position, and word position.</Paragraph>
    <Paragraph position="4"> One flaw with bottom-up clustering approaches to context generalization is that there is no easy way of generalizing to contexts that have not been observed before. Indeed, in a pilot experiment, we found that generalized triphones trained on the resource management task performed poorly on a new voice calculator vocabulary. We believe this was mainly due to the fact that 36.3% of the triphones in the testing vocabulary were not covered, and context-independent phones had to be used.</Paragraph>
    <Paragraph position="5"> In order to overcome these problems, we need a much larger database that has a better coverage of the triphones that are more vocabulary-independent. To that end, we are currently collecting a general English database. Our first step is to use this database to train triphone and generalized triphone models, and then evaluate them on new vocabularies. As this database grows, more triphone-based models can be adequately trained. Eventually, we will be able to model other acoustic-phonetic detail such as stress, syllable position, between-word phenomena, and units larger than triphones.</Paragraph>
  </Section>
  <Section position="6" start_page="271" end_page="272" type="metho">
    <SectionTitle>
4. Databases
</SectionTitle>
    <Paragraph position="0"> Training : The General English Database In order to train VI models, we need a very large training database that covers all English phonetic variations. Fortunately, because our focus is speaker-independent recognition, this database can be collected incrementally without creating an unreasonable burden on any speaker. Initially, this database is a combination of four sub-databases, which we will describe below. Two of the databases were recorded at Texas Instruments in a soundproof booth, and the other two were collected at Carnegie Mellon in an office environment.</Paragraph>
    <Paragraph position="1"> The same microphone and processing were used for all four sub-databases. The ratio of male to female speakers is about two to one in all four sub-databases.</Paragraph>
    <Paragraph position="2"> The first database is the 991-word resource management database \[6\], which was designed for inquiry of naval resources. For this study, we used a total of 4690 sentences from the 80 training and the 40 development test speakers.</Paragraph>
    <Paragraph position="3"> The TIMIT database \[7\] consists of 630 speakers, each saying 10 sentences. We used a subset of this database, including a total of 420 speakers and 3300 sentences. There are total of 4900 different words.</Paragraph>
    <Paragraph position="4"> The Harvard database consists of 108 speakers each say- null ing 20 sentences for a total of 2160 sentences. There are 1900 different words.</Paragraph>
    <Paragraph position="5"> The General English database consists of 250 speakers each saying 20 sentences for a total 5000 sentences. It covers about 10000 different words.</Paragraph>
    <Paragraph position="6"> Testing: The Voice Calculator Database Art independent task and vocabulary was created to test the VI models. This task deals with the operation of a calculator by voice. There are 122 words, including the alphabet and numbers, which are highly confusable. We used 1000 sentences from 100 speakers to train vocabulary-dependent models and 90 sentences from 10 speakers to test various systems under a word-pair grammar with perplexity 53.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML