File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1083_metho.xml
Size: 12,599 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1083"> <Title>apos;Fable 5: Evaluation Male Speakers with Extra Training Data Speaker Baseline Larger-Set Training Training</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> PERFORMANCE OF SRI'S DECIPHER TM SPEECH RECOGNITION SYSTEM ON DARPA'S CSR TASK </SectionTitle> <Paragraph position="0"/> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> SRI International </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1. ABSTRACT </SectionTitle> <Paragraph position="0"> SRI has ported its DECIPHER TM speech recognition system from DARPA's ATIS domain to DARPA's CSR domain (read and spontaneous Wall Street Journal speech). This paper describes what needed to be done to port DECIPHER TM, and reports experiments performed with the CSR task.</Paragraph> <Paragraph position="1"> The system was evaluated on the speaker-independent (SI) portion of DARPA's February 1992 &quot;Dry-Run&quot; WSJ0 test and achieved 17.1% word error without verbalized punctuation (NVP) and 16.6% error with verbalized punctuation (VP). In addition, we increased the amount of training data and reduced the VP error rate to 12.9%. This SI error rate (with a larger amount of training data) equalled the best 600-training-sentence speaker-dependent error rate reported for the February CSR evaluation. Finally, the system was evaluated on the VP data using microphones unknown to the system instead of the training-set's Sennheiser microphone and the error rate only inere~ased to 26.0%.</Paragraph> <Paragraph position="2"> ways; it includes speaker-dependent vs. speaker independent sections and sentences where the users were asked to verbalize the punctuation (VP) vs. those where they were asked not to verbalize the punctuation (NVP). There are also a small number of recordings of spontaneous speech that can be used in development and evaluation.</Paragraph> <Paragraph position="3"> The corpus and associated development and evaluation materials were designed so that speech recognition systems may be evaluated in an open-vocabulary mode (none of the words used in evaluation are known in advance by the speech recognition system) or in a closed vocabulary mode (all the words in the test sets are given in advance). There are suggested 5,000-word and 20,000-word open- and closed-vocabulary language models that may be used for development and evaluation. This paper discusses a preliminary evaluation of SRI's DECIPHER TM system using read speech from the 5000-word closed-vocabulary tasks with verbalized and nonverbalized punctuation.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2. DECIPHER TM </SectionTitle> <Paragraph position="0"> The SRI has developed the DECIPHERm system, an HMM-based speaker-independent, continuous-speech recognition system. Several of DECIPHERr~'s attributes are discussed in the references (Butzberger et al., \[1\]; Murveit et al., \[2\]). Until recently, DECIPHERm's application has been limited to DARPA's resource management task (Pallet, \[3\]; Price et al., \[4\]), DARPA's ATIS task (Price, \[5\]), the Texas Instruments continuous-digit recognition task (Leonard, \[6\]), and other small vocabulary recognition tasks. This paper describes the application of DECIPHERrU to the task of recognizing words from a large-vocabulary corpus composed of primarily read-speech.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 3. THE CSR TASK </SectionTitle> <Paragraph position="0"> Doddington \[7\] gives a detailed description of DARPA's CSR task and corpus. Briefly, the CSR corpus* is composed of recordings of speakers reading passages from the Wall Street Journal newspaper. The corpus is divided in many</Paragraph> </Section> <Section position="6" start_page="0" end_page="410" type="metho"> <SectionTitle> 4. PORTING DECIPHER TM TO THE CSR TASK </SectionTitle> <Paragraph position="0"> Several types of data are needed to port DECIPHER~ to a actions in the domain (not applicable to the CSR task).</Paragraph> <Paragraph position="1"> *The current CSR corpus, designated WSJ0 is a pilot for a large corpus to be collected in the future.</Paragraph> <Section position="1" start_page="410" end_page="410" type="sub_section"> <SectionTitle> 4.1. CSR Vocabulary Lists and Language Models Doug Paul at Lincoln Laboratories provided us with base- </SectionTitle> <Paragraph position="0"> line vocabularies and language models for use in the February 1992 CSR evaluation. This included vocabularies for the closed vocabulary 5,000 and 20,000-word tasks as well as backed-off bigram language models for these tasks.</Paragraph> <Paragraph position="1"> Since we used backed-off bigrarns for our ATIS system, it was straightforward to use the Lincoln language models as part of the DECIPHERa~-CSR system.</Paragraph> </Section> <Section position="2" start_page="410" end_page="410" type="sub_section"> <SectionTitle> 4.2. CSR Pronunciations </SectionTitle> <Paragraph position="0"> SRI maintains a list of words and pronunciations that have associated probabilities automatically estimated (Cohen et al., \[8\]). However, a significant number of words in the speaker-independent CSR training, development, and (closed vocabulary) test data were outside this list. Because of the tight schedule for the CSR evaluation, SRI looked to Dragon Systems which generously provided SRI and other DARPA contractors with limited use of a pronunciation table for all the words in the CSR task. SRI combined its intemal lexicon with portions of the Dragon pronunciation list to generate a pronunciation table for the DECIPHERa~- null CSR system.</Paragraph> <Paragraph position="1"> 4.3. CSR Training Data The National Institute of Standards and Technology provided to SRI several CDROMS containing training, development, and evaluation data for the February 1992 DARPA CSR evaluation. The data were recorded at SRI, MIT, and TI. The baseline training conditions for the speaker-independent CSR task include 7240 sentences from 84 speakers, 3,586 sentences from 42 men and 3,654 sentences from 42 women.</Paragraph> <Paragraph position="2"> 5.2. Results for a Simplified System Our strategy was to implement a system as quickly as possible. Thus we initially implemented a system using four vector-quantized speech features with no cross-word acoustic modeling. Performance of the system on our development set is described in the tables below.</Paragraph> </Section> </Section> <Section position="7" start_page="410" end_page="411" type="metho"> <SectionTitle> 5. PRELIMINARY CSR PERFORMANCE </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="410" end_page="411" type="sub_section"> <SectionTitle> 5.1. Development Data </SectionTitle> <Paragraph position="0"> We have partitioned the speaker-independent CSR development data into four portions for the purpose of this study.</Paragraph> <Paragraph position="1"> Each set contains 100 sentences. The respective sets are male and female speakers using verbalized and nonverbalized punctuation. There are 6 male speakers and 4 female speakers in the SI WSJ0 development data.</Paragraph> <Paragraph position="2"> The next section shows word recognition performance on this development set using 5,000-word, closed-vocabulary language models with verbalized and nonverbalized bigram grammars. The perplexity of the verbalized punctuation sentences in the development set is 90.</Paragraph> <Paragraph position="3"> The female speakers are those above the bold line in Table 1. Recognition speed on a Sun Sparcstation-2 was approximately 40 times slower than real time (over 4 minutes/sentence) using a beam search and no fast match (our standard smaller-vocabulary algorithm), although it was dominated by paging time.</Paragraph> <Paragraph position="4"> A brief analysis of Speaker 422 shows that he speaks much faster than the other speakers which may contribute to the high error rate for his speech.</Paragraph> <Paragraph position="5"> 5.3. Full DECIPHER~-CSR Performance We then tested a larger DECIPHER~ system on our VP development set. That is, the previous system was extended to model some cross-word acoustics, increased from four to six spectral features (second derivatives of cepstra and energy were added) and a tied-mixture hidden Marker model (HMM) replaced the vector-quantized HMM above.</Paragraph> <Paragraph position="6"> This resulted in a modest improvement as shown in the</Paragraph> </Section> </Section> <Section position="8" start_page="411" end_page="411" type="metho"> <SectionTitle> 6. DRY-RUN EVALUATION </SectionTitle> <Paragraph position="0"> Subsequent to the system development, above, we evaluated the &quot;full recognizer' system on the February 1991 Dry-Run evaluation materials for speaker-independent systems.</Paragraph> <Paragraph position="1"> We achieved word error rates of 17.1% without VP and 16.6% error rates with VP as measured by NIST.*</Paragraph> </Section> <Section position="9" start_page="411" end_page="412" type="metho"> <SectionTitle> 7. OTHER MICROPHONE RESULTS </SectionTitle> <Paragraph position="0"> The WSJ0 corpus was collected using two microphones simultaneously recording the talker. One was a Sennheiser HMD-410 and the other was chosen randomly for each speaker from among a large group of microphones. Such *The NIST error rates differ slightly (insigrtificantly) from our own measures (17.1% and 16.6%), however, to be consistent with the other error rates reported in this paper, we are using our internally measured error rates in the tables.</Paragraph> <Paragraph position="1"> dual recordings are available for the training, development, and evaluation materials.</Paragraph> <Paragraph position="2"> We chose to evaluate our full system on the &quot;other-microphone&quot; data without using other-microphone training data. The error rate increased only 62.3% when evaluating with other-microphone recordings vs. the Sennheiser recordings. In these tests, we configured our system exactly as for the standard microphone evaluation, except that we used SRI's noise-robust front end (Erell and Weintraub, \[9,10\]; Murveit, et al., \[11\]) as the signal processing component. Table 4 summarizes the &quot;other-microphone&quot; evaluation results. Speaker 424's performance, where the error rate increases 208.2% (from 18.4% to 56.7%) when using a Shure SM91 microphone is a problem for our system. However, the microphone is not the sole source of the problem, since the performance of Speaker 427, with the same microphone, is only degraded 18.9% (from 9.0 to 10.7%).</Paragraph> <Paragraph position="3"> We suspect that the problem is due to a loud buzz in the recordings that is absent from the recordings of other speakerrs. null</Paragraph> </Section> <Section position="10" start_page="412" end_page="426" type="metho"> <SectionTitle> 8. EXTRA TRAINING DATA </SectionTitle> <Paragraph position="0"> We suspected that the set of training data specified as the baseline for the February 1992 Dry Run Evaluation was insufficient to adequately estimate the parameters of the DECIPHER TM system. The baseline SI training condition contains approximately 7,240 from 84 speakers (half42 male, 42 female).</Paragraph> <Paragraph position="1"> We used the SI and SD training and development data to train the system to see if performance could be improved with extra data. However, to save time, we used only speech from male speakers to train and test the system. Thus, the training data for the male system was increased from 3586 sentences (42 male speakers) to 9109 sentences (53 male speakers).* The extra training data reduced the error rate by approximately 20% as shown in Table 5.</Paragraph> <Paragraph position="2"> *The number of speakers did not increase substantially since the bulk of the extra training data was taken from the speaker-dependent portion of the corpus.</Paragraph> <Paragraph position="3"> Interestingly, this reduced error rate equalled that for speaker-dependent systems trained with 600 sentences per speaker and tested with the same language model used here. However, speaker-dependent systems trained on 2000+ sentences per speaker did perform significantly better than this system.</Paragraph> </Section> class="xml-element"></Paper>