File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1078_metho.xml
Size: 12,288 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?> <Paper uid="H92-1078"> <Title>DARPA FEBRUARY 1992 PILOT CORPUS CSR &quot;DRY RUN&quot; BENCHMARK TEST RESULTS</Title> <Section position="3" start_page="0" end_page="382" type="metho"> <SectionTitle> 2 Benchmark Test Material </SectionTitle> <Paragraph position="0"> The selected portion of the Evaluation Test Set that was distributed for use in the &quot;dry run&quot; benchmark tests, like the training material, included three major components: (1) Longitudinal-Speaker-Dependent speech recognition system test material, for use with the 3 speakers for which approximately 2400 CSR WSJ sentence utterances, per speaker, were available for speaker-dependent system training, (2) Speaker-Dependent system test material, for use with a set of 12 speakers for which 600 sentence utterances were available for speaker-dependent system training, and (3) Speaker-Independent system test material, with 10 speakers. The data was further broken down into verbalized-punctuation (VP) and nonverbalized-punctuation (NVP) and 5,000- vs. 20,000word vocabularies.</Paragraph> <Paragraph position="1"> For the purposes of speaker-independent system development, a specific set of approximately 7200 utterances obtained from an independent set of 84 speakers included in the training portion of the corpus had been designated with the concurrence of the CCCC.</Paragraph> <Paragraph position="2"> The tes.t material included material from SRI, MIT/LCS, TI and NIST (for one Speaker Independent subject). Approximately 50% was from male speakers, and 50% female.</Paragraph> <Paragraph position="3"> As noted elsewhere \[1-2\], the training and test material was selected with reference to predefined 5,000-word and 20,000-word lexicons, but with a controlled percentage of out-of-vocabulary (OOV) items. NIST's analysis of the test sets indicate that the actual occurence of OOV items in the 5,000 word SI test material is approximately 1.4% to 2.0%, and for the 20,000 word SI test material, it is 2.0% to 2.5%. In contrast, for the SI spontaneous test set, the incidence of OOV items with respect to the 5,000 word closed language model is 13.2% to 15.6%, and 5.3% to 5.6% with respect to the 20,000 word language model.</Paragraph> <Paragraph position="4"> For this Pilot CSR Corpus, data was collected with both &quot;primary&quot; and &quot;secondary&quot; microphones. In every case, the primary microphone was a member of the Sennheiser close-talking, supra-aural headset- mounted, noise cancelling family (e.g., HMD-414, HMD-410). However, the microphones used as the secondary microphone were varied, and included boundary effect surface-mounted microphones such as the Crown PCC-160, Crown PZM6FS, and Shure SM91).</Paragraph> </Section> <Section position="4" start_page="382" end_page="382" type="metho"> <SectionTitle> 3 Benchmark Test Protocols </SectionTitle> <Paragraph position="0"> The CCCC had agreed that, insofar as very little time had been allocated for system development and use of the Training and Development Test material, the contractor's results would not be reported to NIST until February 17th, less than one week prior to the Speech and Natural Language Workshop. It was also agreed that existing scoring software would be used as well as previously established procedures for scoring and reporting speech recogntion benchmark tests.</Paragraph> <Paragraph position="1"> The four sites (CMU \[5\], Dragon Systems \[6\], MIT/LL \[7\] and SRI \[8\]) provided NIST with a total of 22 sets of results for a number of test sets and system configurations. The number of test set results provided by individual contractors ranged from 1 to 10.</Paragraph> <Paragraph position="2"> NIST reported scores back to the contractors on February 19th. Subsequently, small discrepancies (typically less than one percent in the individual speakers' scores) were noted between the scores that had been determined at the individual sites and NIST's scores. Some of these discrepancies were due to a problem in handling the occurence of left parenthesis characters, &quot;(&quot;, in the hypothesis strings in NIST's scoring program, and these differences were resolved after the Workshop. Consequently, there may be small unresolved differences between scores reported in this paper and others in this Proceedings.</Paragraph> </Section> <Section position="5" start_page="382" end_page="382" type="metho"> <SectionTitle> 4 &quot;Best&quot; Dry Run Evaluation Test Benchmark Test Results </SectionTitle> <Paragraph position="0"> The DARPA Spoken Language community's efforts to collect, annotate, process, and distribute the Pilot Corpus were challenging and highly stressful. It was generally agreed that there was insufficient time for system development between release of the training data and reporting &quot;dry run&quot; results, and that the systems for which results could be reported at the meeting represented only preliminary efforts.</Paragraph> <Paragraph position="1"> Papers presented at the meeting typically include comments such as: &quot;The training paradigm outlined.., in the description...has only recently been fully implemented...&quot; and &quot;there has not yet been any opportunity for parameter optimization.&quot; \[6\], or &quot;The tests.., reported here are little more than pilot tests for debugging purposes and no strong conclusions can be drawn.&quot; \[7\] and &quot;Our strategy was to implement a system as quickly as possible in order to meet the tight CSR deadline.\[8\]&quot; In view of these comments, and because comparisons across sites would be inconclusive, only a selected sub-set of results reported at the meeting are included in this paper. Several of the participants suggested that it would be acceptable to cite the &quot;best&quot; scores, based on lowest word error rate in a given test subset, and to do so without attribution to any specific system.</Paragraph> <Paragraph position="2"> The &quot;dry run&quot; test results included in this paper (Table 1), are restricted to those selected &quot;best&quot; reported scores, and are presented without attribution to specific systems or sites. References 5 to 8 may contain additional information defining the context of these scores, or contain complementary findings based on experience with the development test sets. Caution should be exercised in interpreting these results as a valid indicator of the state-of-the-art, because of the short time for system development and debugging (as noted above).</Paragraph> </Section> <Section position="6" start_page="382" end_page="383" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The initial &quot;dry run&quot; test results indicate general trends.</Paragraph> <Paragraph position="1"> Many of these trends would seem to be obvious, but are noted because one of the purposes of the CSR Pilot Corpus and the &quot;dry run&quot; was to verify the community's expectations with respect to challenges inherant in large-vocabulary continuous speech recognition, and to guage the relative significance of many factors.</Paragraph> <Paragraph position="2"> * Results for the test sets selected from a smaller vocabulary (5,000 vs. 20,000 words) have lower error rates (e.g., for the longitudinal speaker dependent speakers, for VP, 6.7% word error for the 5,000 word test subset vs. 10.6% for the 20,000 word test subset. null * Results for better-trained speaker dependent systerns are better than for less-well-trained speaker-dependent systems (e.g., 6.7% word error for the test subset for the longitudinal speaker dependent speakers (5,000 word VP) vs. 14.7% for the speaker dependent speakers, for which one-fourth as much training material was available for each speaker).</Paragraph> </Section> <Section position="7" start_page="383" end_page="383" type="metho"> <SectionTitle> * Results for speaker-independent systems have </SectionTitle> <Paragraph position="0"> higher error rates than for speaker-dependent systems (e.g., 16.6% word error for the Speaker Independent subset (5,000-word VP) vs. 12.9% for the corresponding Speaker Dependent subset and 6.7% for the Longitudinal Speaker Independent subset).</Paragraph> <Paragraph position="1"> * Results for the VP test subsets in general have lower error rates than for the NVP test subsets.</Paragraph> <Paragraph position="2"> * Comparison of spontaneous vs. &quot;read spontaneous&quot; data indicates that the read spontaneous has lower error rates (as had been noted with earlier ATIS0 data).</Paragraph> <Paragraph position="3"> evidence of gain changes between sessions and of the use of different secondary microphones for different data collection sessions (i.e., for the adaptation sentences vs. the read Wall Street Journal sessions).</Paragraph> <Paragraph position="4"> Although reservations have been expressed by the participants in this initial &quot;dry run&quot; test, it should be noted that the results are highly encouraging in many ways. As the participants noted, &quot;The successful application of... to the WSJ-CSR task demonstrates the utility of...&quot; 'and &quot;We have also demonstrated the utility of... in the context of a much larger task&quot;.\[5\] and &quot;It is encouraging that.., given there has not yet been any opportunity for parameter optimization.&quot;\[6\], and &quot;The results, however, show promise and will require more rigorous testing.&quot; \[7\] and &quot;This is a preliminary report demonstrating that.., was ported from a 1000-word task (ATIS) to a large vocabulary task (5000-word) task.., in three man weeks.&quot; \[8\] Based on these observations, and on the experience gained in designing, collecting, and distributing the DARPA Pilot CSR Corpus, and in rapidly adapting existing technology to the new domain, there is good reason to look forward to the results of future benchmark tests.</Paragraph> <Paragraph position="5"> The results for the challenging &quot;spontaneous&quot; and &quot;read spontaneous&quot; speech test subsets are based on only one sites's processing of the test data.</Paragraph> <Paragraph position="6"> Only one site \[8\] reported results using both the primary and the secondary microphone(s) for the 5,000 word speaker Independent VP subset, reporting 16.6% word error for the primary microphone and 26.0% for the secondary microphone data. The incremental degradation in performance was regarded by the developers as less than might have been expected and &quot;noteworthy&quot; \[9\], particularly in view of the fact that the broadband signal to noise ratio for the secondary microphone data was typically 20 to 30 dB less than that for the primary microphone data.</Paragraph> <Paragraph position="7"> Substantial variability in the rank-ordering of individual speakers can be noted across systems for those data sub-sets for which more than one site's or system's responses were reported. Analysis of this data suggests that some systems had greater variances across the speaker population than others, perhaps because of inadequate time to develop robust speaker-independent models.</Paragraph> <Paragraph position="8"> NIST's measurements of the broadband S/N ratios for the primary microphone data from MIT, SRI, and TI range from 40 to 48 dB with values for the secondary microphone some 20 to 30 dB less than that. Histograms showing the distribution of levels for these files reveal</Paragraph> </Section> <Section position="8" start_page="383" end_page="383" type="metho"> <SectionTitle> 6 Acknowledgements </SectionTitle> <Paragraph position="0"> At NIST, John Garofolo has been the individual responsible for coordinating and screening much of the CSR data collected at MIT/LCS, SRI and TI. Brett Tjaden assisted in preparation of the master tapes for CD-ROM production at MIT/LCS. Jon Fiscus adapted the NIST speech recognition scoring software for scoring the test results and implemented the software in preparing the official results.</Paragraph> </Section> class="xml-element"></Paper>