File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1009_metho.xml
Size: 18,579 bytes
Last Modified: 2025-10-06 14:12:42
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1009"> <Title>DRAGON SYSTEMS RESOURCE MANAGEMENT BENCHMARK RESULTS FEBRUARY 19911</Title> <Section position="4" start_page="0" end_page="59" type="metho"> <SectionTitle> 2. OVERVIEW OF THE DRAGON CSR SYSTEM </SectionTitle> <Paragraph position="0"> Dragon Systems' continuous speech recognition system was presented at the June 1990 DARPA meeting \[1,2,3\]. The system is speaker-dependent and was demonstrated to be capable of near real-time performance on an 844 word task (mammography reports), when running on a 486-based PC.</Paragraph> <Paragraph position="1"> The signal processing is performed by an additional TMS32010-based board. The speech is sampled at 12 kHz and the signal representation is quite simple: there are only eight parameters -- 7 spectral comlxments covering the region up to 3 kHz and an overall energy ~ameter -- a complete set of which are computed every 20 ms and used as input to the HMM-based recognizer.</Paragraph> <Paragraph position="2"> The fundamental concepaml unit used in the system is the &quot;phoneme-in-context&quot; or PIC, where the word &quot;context&quot; in principle refers to as much information about the surrounding phonetic environment as is nec~ to determine the acoustic character of the phoneme in question. Several related alternative approaches have appeared in the literature \[5,6,7\]. Currently, context for our models includes the identity of the prec~ing and succeeding phonemes as well as whether the phoneme is in a prepausally lengthened segment PICs are modeled as a sequence of PELs (phonetic elements), each of which represents a &quot;slate&quot; in an HMM. PE~ may be shared among PIC models representing the same phoneme. A detailed description of models for PICs and how they are trained may be found in \[2\]. Modifications made to the PIC training procedure are presented in Section 4.</Paragraph> <Paragraph position="3"> Recognition uses frame-synchronous dynamic programming to extend the sentence hypotheses subject to the beam pruning used to eliminate poor paths. Another important component of the system is the rapid matcher, described in \[3\], which limits the number of word candidates that can be hypothesized to start at any given frame. Some alternative approaches to the rapid match problem have ~so been outlined by others \[8,9,10\].</Paragraph> <Paragraph position="4"> A lexicon for the RM task had to be specified before models could be built. Pronunciations were supplied for each entry in the SNOR leficon by extracting them from our standard lexicon. Any entries not found in Dragon's current general English lexicon were added by hand. The set of phonemes used for English contains 24 consonants, 17 vowels (each of which may have 3 degrees of stress), and 3 syllabic consonants. Approximately 22% of the entries in the SNOR lexicon have been given multiple pronunciations. These pronunciations may reflect stress differences, such as stressed and unstressed versions of function words, and expected pronunciation alternatives.</Paragraph> <Paragraph position="5"> Roughly 39,000 PICs are used in modeling the vocabulary for this task. The set of PICs was deXermined by finding all of the PICs that can occur given the constraint that sentences must conform to the word pair grammar. The Iralning data used to build PIC models for the reference speaker comes primarily fi'om general Engli~ isolated words and phrases, supplemented by a few hundred phrases from the RM1 training sentences. The generation andAraining of PICs is dis~ssed in more detail in the next section.</Paragraph> </Section> <Section position="5" start_page="59" end_page="59" type="metho"> <SectionTitle> 3. MODIFICATIONS TO THE SYSTEM FOR USE WITH THE RM TASK </SectionTitle> <Paragraph position="0"> In order to be able to run the RM benchmark task on the Dragon speaker-dependent continuous speech recognition system, several modifications were necessary. These modifications primarily concerned the signal acquisition and preprocessing stages. Prior to this evaluation, the system had only been evaluated on data obtained from Dragon's own acquisition hardware.</Paragraph> <Paragraph position="1"> The signal processing, as described above, has always been performed by the signal acquisition board. Thus it was thought possible that the performance of the system would be highly tuned to the hardware. In order to run the RM clam through the system, software was written to emulate the hardware. One question to be addressed is how well the signal processing software does in fact emulate the hardware. To assess this, a small test was performed using new dam from Dragon's reference speaker. The speaker recorded, using the Dragon hardware, three sets of 100 sentences selected from the development test texts (those of BEF, CMR, and DAS).</Paragraph> <Paragraph position="2"> Recognition was performed, using the reference speaker's base models after adapting to the standard training sentences, and an average word error rate of 3.5% was recorded. The fact that the rate is comparable to error rates of some of the better RM1 speakers suggests that we have errl!dated our slandard signal proc~ng reasonably well. An explicit comparison of performance on the reference speaker using our standard hardware and our software emulation will be available soon.</Paragraph> <Paragraph position="3"> The language model used in the CSR system returns a log probability indicating the score of the candidate word. This was modified to return a fixed score if the word is allowed by the word-pair grammar or a flag denoting that the sequence is impermissible.</Paragraph> <Paragraph position="4"> The standard rapid match module was used in all of the experiments reported in this paper, in order to reduce processing time. We have not focused on the issue of proc~ing time in the current phase of our research, and have therefore modified our standard rapid match parameter settings to be suffidently conservative so as to insure that only a small proportion of the errors are due to rapid match mistakes.</Paragraph> </Section> <Section position="6" start_page="59" end_page="61" type="metho"> <SectionTitle> 4. TRAINING ALGORITHMS FOR THE SPEAKER-DEPENDENT MODELS </SectionTitle> <Paragraph position="0"> Dragon's strategy for phoneme-based training was described in detail in an earlier report\[2\]. We have used a fully automatic version of the same strategy to build speaker-dependent models for each of the RM1 speakers, using the reference speaker's models to provide an initial segmentation.</Paragraph> <Paragraph position="1"> The goal was to build models in which the acoustic parameters and duration estimates were based almost entirely on the 600 training utterances for each speaker, using the reference speaker's models only in rare cases for which no relevant training ~ta is available.</Paragraph> <Paragraph position="2"> The recognition model for a word (or sentence) is obtained by concatenating a sequence of PICs, each of which is, in turn, were selected in the course of the semi-automatic labeling of a large amount of data acqtfired horn the reference speaker. about 9000 isolated words and 6000 short phrases. In changing to the Resource Management task, an additional set of taskspccitic Iralning utterances flora lhe reference speaker were added. Although less than 10% of the training data was drawn from the Resource Management task, most of the PICs that are legal according to the word-pair grammar are representecl somewhere in lhe total training set. Legal PICs missing from the training set are typically like the sequence &quot;ah-uh-ee&quot; that would occur in &quot;WICHITA A EAST': for the most part, they do not occur in the training sentences and seem unlikely to occur in evaluation sentences.</Paragraph> <Paragraph position="3"> The reference speaker's models are speakerMependent in three dk~nct ways: 1. The parameters of the PELs depend on the spectral characteristics of the reference speaker's voice.</Paragraph> <Paragraph position="4"> . The durations for the PEI.~ in each Markov model for a PIC depend on the reference speaker's speaking rote and other features of his speech.</Paragraph> <Paragraph position="5"> . The sequence of PELs used in the Markov model for a PIC depends on what allophone the reference speaker uses in a given context We report on two techniques for creating speaker-dependent PICs starting with the reference speaker's models. The first is a straightforward adaptation algorithm, in which a new speaker's training utterances are segmented into PICs and PELs using a set of base models, and the segments are then used to re-estimate the immmeters of the PELs and of the duration models. This algorithm is typically run multiple times. This ~'oach is very effective in dealing with (1), since the 600 training sentences include dam for almost all of the PELs. This strategy is less effecdve in dealing with (2), since only about 6000 of the 30000 PICs occur in the training scripts. Adaptation alone, however, can do nothing to change (3) the &quot;spelling&quot; of each PIC in terms of PELs.</Paragraph> <Paragraph position="6"> The first technique uses the following two steps: Step 1: The data from all 12 ofthe spe,~ers were used to adapt the reference speaker's models. Three passes of adaplafion weze performed with these data. Since Dragon's algorithm does not yet use mixture distributions, this has the effect of averaging together spectra for male and female talkers and generally &quot;washing out&quot; formants in PELs for vowels. The resulting &quot;multiple speaker&quot; models are not good enough to do speaker-independent recognition, but they serve as a better basis for speaker adaptation than do the reference speaker's models.</Paragraph> <Paragraph position="7"> Step 2: For a given speaker, a maximum of six passes of adaptation are carried out, starting from the multiple-speaker models. The resulting models are used to segment the utterances into phonemes. At this point we have a good speaker-dependent set of PEL models, and a set of segmentations with which to proceed further.</Paragraph> <Paragraph position="8"> The second technique begins with the models produced by the first technique together with the segmentation of the training data into phonemes done using those same models.</Paragraph> <Paragraph position="9"> Using this automatic labeling, speaker-dependent training is performed for each of the RM1 speakers, to produce a new speaker-dependent set of PIC models -- with new PEL spellings and duration models. The algorithm is as follows: Step 1: For each phoneme in turn, all the labeled training dam for that phoneme are extracted from the training sentences. For each PIC that involves the phoneme, an appropriate weighted average of these data is taken to create a spectral model (a sequence of expected values for each frame) for the PIC. Details of this averaging process may be found in our earlier report\[2\], but the key idea is to take a weighted average of phoneme tokens that represent the PIC to be modeled or closely related PICs.</Paragraph> <Paragraph position="10"> The number of PICs to be constructed for each phoneme is of the same order of magnitude as the number of examples of the phoneme in the 600 trairfing sentences. Since there are examples of only about 6000 PICs in the RM1 training sentences, for most PICs the models must be based entirely on data with either the left or right context incorrect. For about one-fifth of the 30000 PICs, therewere insufficient related data to construct a spectlal model (using the usual criteria for &quot;relatedness&quot;). This is frequently the case when a diphone corresponding to a legal word pair fails to occur in the training sentences.</Paragraph> <Paragraph position="11"> Step 2: Dynamic programming is used to construct the sequence of PELs that best represents the spectral model for each PIC, thereby 'Yespelling&quot; the PIC in terms of PELs. This results in a speaker-dependent PEL spelling for each PIC. In the process, speaker-dependent durations for each PEL in a PIC are also computed.</Paragraph> <Paragraph position="12"> Step 3: Step 2 results in respelled PICs for those PICs for which sufficient training data are available. For the remaining approximately 6000 PICs, the adapted PIC models of the reference speaker are used (as in technique 1). Merging these Pies results in a model for every legal PIC in the word-pair grammar.</Paragraph> <Paragraph position="13"> the training data into PELs and then re-estimating the parameters of the speaker-dependent PELs. In the process, duration distributions are also re-estimated.</Paragraph> <Paragraph position="14"> The above algorithm to create speaker-dependent PIC models provides two sets of models with which we have experimented. The first set is referred to as speaker-dependent RM models. The second set is the output of the final stage, and is referred to as the respeHed speaker-dependent RM models.</Paragraph> <Paragraph position="15"> Both sets of speaker-dependent models may contain unchanged PICs from the original reference speakex when no training data was available -- mainly unchanged duration models, since most PELs are used in a variety of PICs.</Paragraph> </Section> <Section position="7" start_page="61" end_page="62" type="metho"> <SectionTitle> 5. RECOGNITION EXPERIMENTS AND DISCUSSION </SectionTitle> <Paragraph position="0"> In this section we present results making use of the two sets of speaker dependent models, as well as results on post processing with the N-best algorithm.</Paragraph> <Section position="1" start_page="61" end_page="62" type="sub_section"> <SectionTitle> 5.1 Comparison of two methods for speaker- dependent training </SectionTitle> <Paragraph position="0"> The error rates using each of the training strategies are shown in Table 1. In this table we display the word error rates on the 100 development test sentences for each of the 12 RM1 speakers, and we also display the performance of the respelled models on the Feb91 evaluation data, which consisted of 25 sentences for each speaker.</Paragraph> </Section> <Section position="2" start_page="62" end_page="62" type="sub_section"> <SectionTitle> Analysis of Errors for Speaker-Dependent Respelled PICs </SectionTitle> <Paragraph position="0"> In the course of our research it has been enlightening to investigate the errors. We will now focus our discussion on the performance of the respelled models when recognizing the development data: The word error rates are seen to range from a low of 2.5% for speaker HXS to 10.5% for ERS, with an overall average error rate of 5A%. When the very same system is run without the rapid match module, the amount of compmafon is vastly increased, but there is only a small reduction of the observed overall error rate from 5.4% to 5.1%. Roughly 62% of the errors involve function words only, and the remaining 38% involve a content word (and may also include a function word error). Function words have an error rate of 7.6% compa~d to 2.5% for content words. The most common content word error is &quot;SPS-40&quot; which is often misrecognized as &quot;SPS-48&quot;. Other content word errors often involve homophones (such as &quot;ships+s&quot; --~ &quot;ships&quot;). Function word deletions are more common than insertions, and substitutions may be symmetric C/'and&quot; --> &quot;in&quot; are as frequent as &quot;in&quot; --> &quot;and&quot;) or asymmetric C'theh ~' --> &quot;the&quot; but the reverse confusion does not occur). Other common errors involve contractions: &quot;what is&quot; -> ''what+s&quot; and ''when will&quot; --> &quot;when+ll&quot;.</Paragraph> <Paragraph position="1"> Use of alternate pronur~iafions Approximately 22% of the lexical entries have alternate pronunciations. These variants are used to express expected pronunciation alternations and/or stress differences.</Paragraph> </Section> <Section position="3" start_page="62" end_page="62" type="sub_section"> <SectionTitle> 5.2 N-Best Algorithm Test. </SectionTitle> <Paragraph position="0"> A recognition pass using an N-Best algorithm was performed on the development test data. The N-Best algorithrn which we have implemented is similar to the one proposed by Soong and Huang\[4\]. It runs as a post-processing step and is essentially a stack decoder which processes the speech in reverse time. Computational results saved during the forward pass are used to provide very close approximations to the best score of a full transcription which extends a reverse partial transcription. Although a more complete description of the algorithm is beyond the scope of the paper, we note that a key difference between the algorithm we use and that of Soong and Huang is that we do a full acoustic match in the reverse pass (i.e., we process the speech dam). Also, the reason our extension sc~es are only approximate is that in our current implemenlation, the forward and reverse acoustic match scores are different.</Paragraph> <Paragraph position="1"> The test was run on the 1200 utterances from the RM1 development sentences, 100 each from the 12 RM1 speakers.</Paragraph> <Paragraph position="2"> The parameters conlrolling the N-Best were set conservatively.</Paragraph> <Paragraph position="3"> With high confidence, the 100 best alternative sentence transcriptions were delivered (slowing down the recognition by about a factor of six). These transcriptions included ones differing only in placement of internal pauses and/or alternative pronunciations. If such transcriptions are considered identical, 17 choices were delivered on average. The results given below do consider such transcriptions as being identical.</Paragraph> <Paragraph position="4"> The forward algorithm determined the correct transcription 70% of the time, and the N-Best algorithm delivered it as a choice 94% of the time (almost always as one of the top 15).</Paragraph> <Paragraph position="5"> That is, for around 80% of the misrecognitions, the correction was on the choice list A cumulative count (based on the 1200 test utterances) is given in Table 2. For instance, the ccorrect transcription was one of the top 5 choices 90% of the time.</Paragraph> </Section> </Section> class="xml-element"></Paper>