File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1012_metho.xml
Size: 32,207 bytes
Last Modified: 2025-10-06 14:12:41
<?xml version="1.0" standalone="yes"?> <Paper uid="H91-1012"> <Title>BYBLOS SPEECH RECOGNITION BENCHMARK RESULTS</Title> <Section position="4" start_page="0" end_page="77" type="metho"> <SectionTitle> BYBLOS SYSTEM DESCRIPTIOPN </SectionTitle> <Paragraph position="0"> The BBN BYBLOS system had the following notable characteristics for the Feb 91 evaluation: phone, left-diphone, right-diphone, and phoneme contexts and included cross-word-boundary contexts. The individual context models were trained jointly in forward-backward.</Paragraph> <Paragraph position="1"> * Cooocurrence smoothing matrices were estimated from the triphone contexts only and then were applied to all HMM observation densities.</Paragraph> <Paragraph position="2"> * C~nder-dependent models were used for SI recognition. Each test sentence was decoded on both models and the final answer was chosen automatically by picking the one with the higher probability.</Paragraph> <Paragraph position="3"> * For all SI results other than the 109 RM condition, SI models were created by combining multiple, independently-trained and smoothed, speaker-dependent (SD) models. For the SI 109 condition, however, the training data was simply pooled before training.</Paragraph> <Paragraph position="4"> This baseline system is the same for both RM and ATIS results reported below.</Paragraph> </Section> <Section position="5" start_page="77" end_page="78" type="metho"> <SectionTitle> RESOURCE MANAGEMENT RESULTS </SectionTitle> <Paragraph position="0"> SI Recognition with 109 Training Speakers At the June 90 DARPA workshop, we reported our first result on the standard 109 SI training condition of 6.5% word error on the Feb 89 SI test set using the word-pair grammar. When we retest with the current system the en~r rate is reduced to 4.2%. For these two tests, the system differed in three ways. First, we augmented the signal representation with vectors of second--order differences for both the cepstral and energy features. Secondly, the discrete observation densities of the phonetic HMMs were generalized to mixtures of Gaussians that were tied across all states of all models, as in \[2\], \[5\]. The VQ input to the trainer preserved the 5 most probable mixture components for each speech frame. In the earlier system, only the input to the decoder was modeled as a mixture of Ganssians. To date, we have not found any improvement for re-estimating the parameters of the mixture components within forward--backward. Finally, in the newer system, we trained separate codebooks and HMMs for male and female talkers and selected the appropriate model automatically.</Paragraph> <Paragraph position="1"> We observed a small additive improvement for each of these three changes to the system.</Paragraph> <Paragraph position="2"> For the current Feb 91 evaluation, we ran our latest system on the standard 109 SI training condition with both no-grammar and the word--pair-grammar. Results for these runs are shown in the first two rows of table 1 below.</Paragraph> <Paragraph position="3"> SI Recognition with 12 Training Speakers Since it is often difficult and expensive to obtain speech from hundreds of speakers for a new domain, we recently proposed \[71 creating SI models from a much smaller number of speakers but using more speech from each speaker. To test the proposal, we ran an experiment using 600 utterances from each of the 12 speakers in the SD segment of the RM corpus.</Paragraph> <Paragraph position="4"> 12 speakers could hardly be expected to cover all speaker types in the general population (including both genders), so we anticipated that smoothing would be needed to make the models robust for new speakers. Our usual technique for smoothing across the bins of the discrete densities, triphone cooceurrence smoothing \[10\], has proven * to be an effective method for dealing with the widely varying amounts of training data for the detailed context-dependent models in the Byblos system. This technique estimates the probability of any pair of discrete spectra ceooecurring in a density by counting over all the densities of the triphone HMMs. These probabilities are organized into a set of phoneme--dependent confusion matrices which are then used to smooth all the densities in the system.</Paragraph> <Paragraph position="5"> The data from each training speaker is kept separate through forward--backward (Baum-Welch) training and cooccurrenee smoothing. The individually trained SD HMMs are then combined by averaging their parameters. We have found that this approach leads to better SI performance from a small number of training speakers than the usual practice of pooling of all the data prior to training and smoothing.</Paragraph> <Paragraph position="6"> In table 1, we show that the model made from 12 training speakers performs as well as the standard 109 speakers on the Feb 91 SI test set. This is better than we had expected based on our previous experience with the Feb 89 SI test set. To get a better estimate of the relative performance of the two approaches, we tested the current system on three evaluation test sets (Feb 89, Oct 89, Feb 91). Averaging the results for these 30 test speakers, the SI 109 model achieved 3.9% word error while the SI 12 model got 4.5%. This is a very small degradation for nearly a 10-fold decrease in the number of training speakers.</Paragraph> <Paragraph position="7"> Adaptation to Dialect We found that our current state-of-the-art (SI) models perform poorly when a test speaker's characteristics differ markedly from those of the Ixaining speakers. The speaker differences which cause poor recognition are not well understood, but outliers of this sort are not a rare phenomenom. Our SI models have difficulty with the RM test speaker RKM, for instance, a native speaker of English with an African-American dialect. Moreover, non-native speakers of American English nearly always suffer significantly degraded SI performance.</Paragraph> <Paragraph position="8"> The problem was made obvious in a pilot experiment we performed recently. As noted above, our baseline SI performance is currently about 4% word error using the standard 109 training speakers and word-pair grammar. But when we tested four non-native speakers of American English under the same conditions, the word error rates ranged from 22% to 45%, as shown in table 2.</Paragraph> <Paragraph position="9"> The table also shows the native language of each speaker and the number of years that each has spoken English as their tnSmary language. Even though they vary widely in their experience with English (and in their subjective intelligibility), each of them suffered severely degraded SI performance. Even native speakers of British English are subject to this degradation, as the result from speaker SA demonstrates. Furthermore, the result from speaker JM shows that this problem does not disappear even after many years of fluency in a second language.</Paragraph> <Paragraph position="10"> We then tried to adapt the training models to the new dialects by estimating a probabilistic speOral mapping between each of the training speakers and the test speaker as described in \[7\]. The resulting set of adapted models are combined to obtain a single adapted model for the test speaker. In this experiment, we used the 12 SD speakers from the RM corpus as training speakers. Each test speaker provided 40 utterances for adaptation and 50 additional ones for testing. The word error rates after adaptation are also shown in table 2. The over-all average word error rate after speaker adaptation is 5 times better than SI recognition for these speakersl The success of speaker adaptation in restoring most of the performance degradation is quite surprising given that no examples of these dialects are included in the training data Fmlbermore, only spectral and phonetic differences are modeled by our speaker adaptation procedure. No lexical variations were modeled directly; we used a single speaker-independent phonetic dictionary of American English pronunciations. These results show that systematic spectral and phonetic differences between speakers can account for most of the differences in the speech of native and non-native speakers of a language.</Paragraph> </Section> <Section position="6" start_page="78" end_page="78" type="metho"> <SectionTitle> THE ATIS CORPUS </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="78" end_page="78" type="sub_section"> <SectionTitle> Corpus Description </SectionTitle> <Paragraph position="0"> The ATIS corpus consists of severai different types of speech data, collected in different ways. First, there are approximately 900 utterances that were collected during a &quot;Wizard&quot; simulation of an actual ATIS system. The subjects were trying to perform a particular task using the system. This data was collected from 31 speakers.</Paragraph> <Paragraph position="1"> The data from five of the speakers was used for the test of the natural language systems prior to the June 1990 meeting. These have since been designated as the development test speech. Thus, there remained 774 spontaneous training sentences from 26 speakers.</Paragraph> <Paragraph position="2"> In addition to the spontaneous sentences, several of the subjects read cleaned up versions of their spontaneous queries. Finally, 10 subjects each read 300 sentences during a single 1-hour session. The first 100 sentences were read by all the subjects. The next 200 were chosen at random from a list of 2800 sentences constructed by BBN and SRI, by generalizing from previously recorded sentences from several sites. The average total duration of the 300 sentences was about 18.5 minutes per speaker (counting only the parts of the utterances containing speech).</Paragraph> <Paragraph position="3"> The 774 sentences from a total of about 30 speakers is clearly not sufficient for creating a powerful speaker-independent speech model.</Paragraph> <Paragraph position="4"> Collecting speech from an additional 70 speakers would require a large additional effort. Therefore, the additional 3000 sentences read by the 10 speakers provided the most efficient source of speech for estimating a speaker-independent model.</Paragraph> </Section> </Section> <Section position="7" start_page="78" end_page="79" type="metho"> <SectionTitle> ATIS Training Speech Characteristics </SectionTitle> <Paragraph position="0"> The subjects were instructed to push a button (push-to-talk) before speaking. However, frequently, they pushed the button quite a while before they began speaking. In many cases, they breathed directly on the microphone windscreen, which was apparently placed directly in front of the mouth and nose. Therefore, many files contain long periods of silance with interspersed noises. In fact, only 55% of the total duration of the training data contains speech (for both the read and spontaneous data). In addition, some subjects paused in the middle of a sentence for several seconds while thinking about what to say next. Others made side comments to themselves or others while the microphone was live. All of these effets are included in the speech data, thus making it much more difficult to recognize than previously distributed da~a~ In the RaM corpus, there was a concerted effort to use subjects from several dialectal regions. In addition, since the speakers were reading, they tended toward standard General American. Thus, the models generated from this speech were reasonably robust for native speakers of American English. In contrast, the ATIS corpus consisted primarily of speakers from the South (26 of 31 speakers were labeled South Midland or Southern).</Paragraph> <Paragraph position="1"> In order to estimate speech models, we need an accurate transcription of what is contained in each speech file. This transcription usually is in the form of the string of words contained in each sentence. However, since this was spontaneous speech, there were often nonspeech sounds and long periods of silence included among the words. Most of these effects were marked for the spontaneous speech.</Paragraph> <Paragraph position="2"> Unfortunately, the transcriptions distributed with the read speech did not follow the usual conventions for the stxing of words. A significant amount of work was required to correct these inconsistencies. This work was undertaken by BBN and NIST, and was thoroughly checked by Lincoln. When all the corrections had been made, they were redistributed to the community.</Paragraph> <Paragraph position="3"> Definition of Common Baseline The new ATIS task presents several new problems for speech recognition. Therefore, it will be essential to try many new algorithms for dealing with it. These experiments wiLl deal with a wide variety of topics, including the makeup of the training data, the vocabulary, and the grammar. However, it is just as important with this domain, as it was with the RM domain, that we use well-founded controlled experiments across the different sites. Without a baseline, meaningful comparisons between techniques cannot be made. It is necessary in order for researchers at other sites to be able to determine whether a new technique has actually made a significant improvment over previous techniques.</Paragraph> <Paragraph position="4"> Since no common evaluation condition has been specified by the speech performance evaluation committee, BBN and IV\[IT/Lincoln have defined, promoted, and distributed an ATIS control condition to provide a common reference baseline. This baseline condition consists of a common lexicon, training set, and statistical grammar.</Paragraph> <Paragraph position="5"> In order to provide a useful baseline condition, all of the standardized data should represent a reasonable approximation to current state-of-the-art conditions. BBN and Lincoln undertook to define such a baseline condition, under the severe constraints of limited available data and time.</Paragraph> <Paragraph position="6"> Training Set We defined as the baseline training set, all of the speech that had been distributed by NIST on CD-ROM excepting the spontaneous speech spoken by the speakers used for the June 1990 test of ATIS. This test consisted of 138 sentences spoken by a total of 5 speakers, Cod, bf, bm, bp, bw). While an additional 435 sentences that been recorded at SRI were made available on tapes at a later date, we felt that the small amount of additional speech and the late date did not warrant including the speech in the baseline conditon.</Paragraph> <Paragraph position="7"> Of course the data was available to all who wanted to use it in any other experimental or evaluation condition.</Paragraph> <Paragraph position="8"> Vocabulary One of the variables in designing a real speech system is to specify the recognition vocabulary. Given that we do not know what words will be included in the test speech, we have to make our best attempt to include those words that would seem reasonably likely. Of course, ff we include too many words, the perplexity of the grammar will inerease, and the recognition error rate will increase. We felt that, for a baseline condition, the vocabulary must be kept fixed, since we wanted to avoid random differences between sites due to correctly guessing which words would occur in the test.</Paragraph> <Paragraph position="9"> We decided, at BBN to define a standard vocabulary based on the transcriptions of all of the designated training data. Thus, all of the words included in the Standard Normal Orthographic Representation (SNOR) were included in the dictionary. We made sure that many fixed sets of words, such as the days, months, and numbers were complete. In addition, we filled out many open class word categories based on the specific ATIS database that was being used. This ineluded plurals and possessive forms of words wherever appropriate.</Paragraph> <Paragraph position="10"> This included names of airlines, plane types, fare codes, credit cards, etc. When we did this completely, the result was a vocabulary of over 1700 words, most of which seemed quite unlikely. Therefore, we applied an additional constraint on new open class words. We added to the vocabulary only the names of all of the airlines and plane types, etc., that served the 10 cities whose flights were included the current database. In total, we added about 350 words to the vocabulary actually used in the training speech. This brought the baseline vocabulary up to 1067 words. The result, when measured on the development set, was that the number of words in the test that were not in the vocabulary was decreased from 13 to 4.</Paragraph> <Paragraph position="11"> Grammar While the definition of the grammar to be used in speech recognition is certainly a topic of research, it is necessary to have a baseline grammar with which any new grammars may be compared. It is also essential that this standard grammar be relatively easy for most sites to implement, in order that this not be an impediment to the use of the baseline condition. Therefore, Lincoln estimated the parameters of a statistical bigram grammar using the back-off technique developed by IBM \[6\]. The derivation of this grammar is described in more detail in \[9\]. The transcriptions used to estimate this grammar included those of all of the speech in the designated training set (SNOR transcriptions only) and also used the 435 transcriptions for the new SRI set. The parameters of this model were distributed in simple text format so that all sites could use it easily. The grammar has a test set perplexity of 17 when measured on the development test set. Thus, it provided a low upper bound for comparison with any new language models.</Paragraph> </Section> <Section position="8" start_page="79" end_page="80" type="metho"> <SectionTitle> BBN SPEECH TECHNIQUES USED FOR ATIS </SectionTitle> <Paragraph position="0"> In this section we describe the techniques that we used in the ATIS speech recognition evaluation. In particular, we only discuss those techniques that differed from those used for RM. The techniques include: 1. Speech/silence detection.</Paragraph> <Paragraph position="1"> 2. N-Best recognition and rescoring with detailed models. 3. Optimization.</Paragraph> <Paragraph position="2"> Each of these techniques will be described below.</Paragraph> <Section position="1" start_page="79" end_page="80" type="sub_section"> <SectionTitle> Speech/Silence Detection </SectionTitle> <Paragraph position="0"> As described in a previous section, both the training and test speech contained large regions of silence mixed with extraneous noises.</Paragraph> <Paragraph position="1"> While the HMM training and recognition algorithms are capable of dealing with a certain amount of silence and background noise, they not very good at dealing with large periods of silence with occasional noise. Therefore, we applied a speech end-point detector as a preprocess to the training and recognition programs. We found that this improved the ability of the training algorithms to concentrate on modeling the speech, and of the recognition programs to recognize sentences.</Paragraph> <Paragraph position="2"> N-Best Recognition Since the purpose of the speech recognition is to understand the sentences, we needed to integrate it with the natural language (NL) component of the BBN HARC spoken language understanding system. For this we use the N-Best recognition paradigm \[3\]. The basic steps aae enumerated bdow: 1. Find N-Best hypotheses using non-cross-word models and bi-gram grammar 2. For each hypothesis: (a) rescore acoustics with cross-word models (b) score word stxing with a more powerful statistical grammar null 3. Combine scores and reorder hypotheses 4. Report highest scoring answer as speech recognition result 5. Feed ordered list to NL For efficiency, we use the Word-Dependent N-Best algorithm \[11 \]. In addition to providing an efficient and convenient interface between speech and NL, the N-Best paradigm also provides an efficient means for applying more expensive speech knowledge sources. For example, while the use of cross-word triphone models reduces the error rate by a substantial factor, it greatly increases the storage and computation of recognition. In addition, a trigram or higher order language model would immensely increase the storage and computa~on of a recognition algorithm. However, given the N-Best hypotheses obtained using non-cross-word triphone models, and a bigrarn grammar, each hypothesis can be reseored with any knowledge source desired. Then, the resulting hypotheses can be reordered. The top scoring answer is then the speech ~eognition result. The entire list is then sent to the NL component, which chooses the highest answer that it can interpret. By using the N-Best paradigm we have found it efficient to apply more expensive knowledge sources (as a post process) than we could have considered previously. Other examples of such knowledge sources include: Stochastic Segment Models \[8\] or Segment Neural Networks \[ 1 \].</Paragraph> <Paragraph position="3"> Optimization We usually run the recognition several times on development test data in order to find the optimal values for a few system parameters, such as the insertion penalty, and the relative weight for the grammar and acoustic scores. This is a very slow and inexact process. However, given the N-Best paradigm, it is a simple matter to find the values that maximize recognition accuracy. Briefly, we generate several hypotheses for each utterance. For each hypothesis, we factor the total score into a weighted combination of the acoustic score(s), the language model score, and the insertion penalty. Then, we search for the values of the the weights that optimize some measure of correctness over a corpus. This technique is described more fully in \[81.</Paragraph> </Section> </Section> <Section position="9" start_page="80" end_page="81" type="metho"> <SectionTitle> ATIS BBN AUGMENTED CONDITION </SectionTitle> <Paragraph position="0"> We decided to consider three different conditions beyond those specified in the common baseline condition. These include: 1. Use of additional training speech 2. Inclusion of explicit nonspeech models 3. More powerful statistical grammars Additional training speech One of the easiest ways to improve the accuracy of a recognition system is to train it on a Larger amount of speech, from a representative sample of the population that will use it. Since there was clearly not time to record speech from a very large number of speakers, we decided to record a large amount of speech from a smaller number of speakers. We had shown previously \[7\] that this training paradigm results in similar accuracy, with a smaller data collection effort (since the effort is largely proportional to the number of speakers rather than the total amount of speech.) We collected over 660 sentences from each of 15 speakers. Five were male and ten were female. Due to the lack of time, most of the speakers were from the northeast. However, we made an effort to include 4 female speakers from the Southeast and South Midland regions. We found that, once started, the subjects were able to collect about 300 sentences per hour comfortably.</Paragraph> <Section position="1" start_page="80" end_page="80" type="sub_section"> <SectionTitle> Nonspeech Models </SectionTitle> <Paragraph position="0"> One of the new problems in this speech data is that there were nonspeech sounds. Some were vocal sounds (e.g. &quot;UH&quot;, &quot;MIVI', cte.), while some were nonvocal (e.g. laughter, coughing, paper rostling, telephone tings, etc.). The only frequent nonspeech sound was &quot;UH&quot;, with 57 occurrences in the training corpus. All the rest occurred oniy 1 to 5 times. We created a separate &quot;word&quot; for each such event. Each consisted of it's own special phoneme or two phonemes. All of them were included in the same language model class within the statistical language model.</Paragraph> <Paragraph position="1"> While several of the nonspeech events were correctly detected in the development test speech, we found that the false alarm rate (i.e.</Paragraph> <Paragraph position="2"> typically recognizing a short word like &quot;a&quot; as &quot;UH&quot;) was about equal to the detection rate. Thus, there was no real gain for using nonspeech models in our development experiments.</Paragraph> </Section> <Section position="2" start_page="80" end_page="81" type="sub_section"> <SectionTitle> Statistical Language Models </SectionTitle> <Paragraph position="0"> In this section, we discuss the use of statistical language models that have been estimated from very limited amounts of text. We argue that it is clearly necessary to group words into classes to avoid robustness problems. However, the optimal number of classes seems to be higher than expected.</Paragraph> <Paragraph position="1"> Since there is essentially no additional cost for using complex language models within the N-Best paradigm, we decided to use a 4-gram statistical class grammar. 'Dmt is, the probability of the next word depends the classes of the three previous words.</Paragraph> <Paragraph position="2"> Need for Class Grarnma~'s An important area of research that has not received much attention is how to create powerful and robust statistical language models from a very limited amount of domain-dependent training data. We would eertainly like to be able to use more powerful language models than a simple word-based bigram model.</Paragraph> <Paragraph position="3"> Currently, the most powerful &quot;fair&quot; grammars used within the program have been statistical bigram class grammars. These grammars, which use padded maximum likeliD~od estimates of class pairs, allow all words with some probability, and share the statistics for words that are within the same domain-dependent class. One issue of importance in defining a class grammar is the optimal number of classes into which words should be grouped. With more classes we can better distinguish between words, but with fewer classes there is more statistical sharing between words making the grammar more robust.</Paragraph> <Paragraph position="4"> We compared the perplexity with three different grammars for the RM task with 100 classes, 548 classes, and 1000 classes respectively. In the first, words were grouped mainly on syntactic grounds, with additional classes for the short very common words. In the second, we grouped into classes only those words that obviously belonged together. (That is, we had elt~.~es for shipnames, months, digits, etc.) Thus, most of the classes eoteained only one word. In the third grammar, there was a separate class for every word, thus resulting in a word-bigram grammar. We used the backing off algorithm to smooth the probabilities for unseen bigrarns. The Perplexities of the three grammars measured on training data and on independent sentences am given in the table below.</Paragraph> <Paragraph position="5"> in 8 and test set.</Paragraph> <Paragraph position="6"> As shown in table 3, perplexity on the training set decreases as the number of classes increases, which is to be expected. What is interesting is the perplexity on the test set. Of the three grammars, the 548-class grammar results in the lowest test set perplexity. (Interestingly, the 548-class grammar is easier to specify than the 100-class grammar.) The increased perplexity for the 1000-class grammar is due to insufficient training data.</Paragraph> <Paragraph position="7"> The effective difference between the 548- and 1000-class grammars was larger than implied by the average perplexity. The standard deviation of the word entropy was one half bit higher, which resulted in a doubling in the standard deviation of the perplexity. To explain, the word bigram grammar frequently has unseen word pairs with very low probabih'ty, while this effect is greatly reduced in the class grammar. Thus, as expected, the class grammar is much more robust.</Paragraph> <Paragraph position="8"> Initial recognition e:-periments also seem to indicate a factor of two difference in error rate between a class bigram grammar and a word bigram grammar of the same perplexity. These effects are likely to be even larger when we use higher order n-gram models.</Paragraph> </Section> </Section> <Section position="10" start_page="81" end_page="81" type="metho"> <SectionTitle> ATIS RECOGNITION RESULTS </SectionTitle> <Paragraph position="0"> The table below contains the recognition results for the ATIS corpus for both the development test set and the evaluation set. The first line shows the recognition results for the development test set consisting of 138 sentences spoken by five speakers (bd, bf, bin, bp, bw). All speech data from these five speakers was left out of the training. The development results are given for the &quot;augmented&quot; enndition only.</Paragraph> <Paragraph position="1"> Next, we give the results for the ewAuation test set. The first two results are the baseline condition and our augmented condition. We also give results separately for the subset of 148 sentences that were designated as Class A (unambiguous, context-independent queries) for the NL evaluation.</Paragraph> <Paragraph position="2"> To review the two basic conditions, the baseline condition used the standard vocabulary, training set, and grammar throughout. The augmented condition used more training data, a 4-gram class grammar, and a nonspeech model.</Paragraph> <Paragraph position="3"> The first clear result is that the error rates for the evaluation test set are more than twice those of the development test set. In addition, the perplexity of the evaluation test set is significantly higher than for the development set (26 instead of 17 for the standard word-based bigram grammar, and 22 instead of 13 for the 4-gram class grammar).</Paragraph> <Paragraph position="4"> Thus, we surmise that the evaluation data is somehow significantly different than both the training data and the development test set.</Paragraph> <Paragraph position="5"> Next, it is clear that the Class A subset of the sentences presents fewer problems for the recognition. This is also indicated in the perplexities we computed for the two subsets.</Paragraph> <Paragraph position="6"> Finally, we see that, for both the full set of 200 sentences and the Class A subset of 148, the augmented enndition has about 20%-30% fewer word errors than the baseline condition. We are currently attempting to understand the causes of this improvement by more careful comparison to the baseline. The augmented condition was rerun after including the training data from the held-out development test speakers (about 900 utterances), but this made no difference.</Paragraph> <Paragraph position="7"> We suspect, therefore, that very little gain was also derived from the additional training speech collected at BBN (which suffers from both environmental and dialectal differences). We have also retested with a class higram grammar instead of the 4-gram, and again, there was no change in performance. This behavior may be explained by the large difference between the evaluation test and the training. It is interesting, then, that the higher order grammar did not degrade in the presence of such a difference. This result also indicates that smoothing a word-based bigram by class definitions is important for training statistical grammars from small training corpora. We have not retested without the nonspeech models, but their eon~hation appears small from a preliminary review of the recognition errors made. The two worst test speakers were also the ones that tended to produce numerous pause fillers (e.g. &quot;UH&quot;, &quot;ulvr') as well as many other disfluencies. Clearly, better nonspeech modeling will be essential if we continue to evaluate on this kind of data.</Paragraph> </Section> class="xml-element"></Paper>