XML Viewer - h94-1011

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/h94-1011_metho.xml
Size: 21,238 bytes
Last Modified: 2025-10-06 14:13:47
<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1011">
  <Title>1993 BENCHMARK TESTS FOR THE ARPA SPOKEN LANGUAGE PROGRAM</Title>
  <Section position="4" start_page="0" end_page="49" type="metho">
    <SectionTitle>
2. WSJ-CSR TESTS
2.1. New Conditions
</SectionTitle>
    <Paragraph position="0"> All sites participating in the WSJ-CSR tests were required to submit results for (at least) one of two &amp;quot;Hub&amp;quot; tests. The Hub tests were intended to measure basic speaker-independent performance on either a 64K-word (Hub 1) or 5K-word (Hub 2) read-speech test set, and included required use of either a &amp;quot;standard&amp;quot; 20K trigram (Hub 1) or 5K bigram (Hub 2) grammar, and also required use of standard training sets.</Paragraph>
    <Paragraph position="1"> These requirements were intended to facilitate meaningful cross-site comparisons.</Paragraph>
    <Paragraph position="2"> The &amp;quot;Spoke&amp;quot; tests were intended to support a number of different ehaUenges.</Paragraph>
    <Paragraph position="3"> Spokes 1, 3 and 4 supported problems in various types of adaptation: incremental supervised language model adaptation (Spoke 1), rapid enrollment speaker adaptation for &amp;quot;recognition outliers&amp;quot; (i.e., non-native speakers) (Spoke 3), incremental speaker adaptation (Spoke 4). \[There were no participants in what had been planned as Spoke 2.\] Spokes 5 through 8 supported problems in noise and channel compensation: unsupervised channel compensation (Spoke 5), &amp;quot;known microphone&amp;quot; adaptation for two different microphones (Spoke 6), unsupervised channel compensation for 2 different environments (Spoke 7), and use of a noise compensation algorithm with a known alternate microphone for data collected in environments when there is competing &amp;quot;calibrated&amp;quot; noise (radio talk shows or music) (Spoke 8). Spoke 9 included spontaneous &amp;quot;dictation-style&amp;quot; speech. Additional details are found in Kubala, et al. \[1\], on behalf of members of the ARPA Continuous speech recognition</Paragraph>
    <Section position="1" start_page="0" end_page="49" type="sub_section">
      <SectionTitle>
Corpus Coordinating Committee (CCCC).
2.2. WSJ-CSR Summary Highlights
</SectionTitle>
      <Paragraph position="0"> The design of the &amp;quot;Hub and Spoke&amp;quot; test paradigm, was such that opportunities abounded for informative contrasts (e.g., the use of bigram vs. trigram grammars, the enablement/disablement of supervised vs. unsupervised adaptation strategies, ete).</Paragraph>
      <Paragraph position="1"> There were nine participating sites in the Hub I tests and five sites participating in the Hub 2 tests, and some sites reported results for more than one system or research team. The lowest word error rate in the Hub 1 baseline condition was achieved by the French CNRS-LIMSI group \[2,3\].</Paragraph>
      <Paragraph position="2"> Application of statistical significance tests indicated that the performance differences between this system and a system  developed by Cambridge University Engineering Department using the &amp;quot;HMM Toolkit&amp;quot; approach \[4-6\], were not significant. The Cambridge University HMM Toolkit approach also yielded excellent results for the smallervocabulary Hub 2 tests. The lowest word error rate for an ARPA contractor on the Hub 1 test data, for the C1 condition permitting valid cross-site comparisons, was reported by the group at CMU \[7-9\]. The CMU results were not significantly different from the corresponding results for the Cambridge University HMM Toolkit system. The lowest word e:rror rate for an ARPA contractor for the (less constrained) P0 condition was reported by the group at BBN. R is difficult to summarize results of the spoke tests, except to note that there were results reported for 8 different &amp;quot;spoke conditions&amp;quot;, with from 1 to 3 participants and systems typically involved in each spoke. Details are presented in the Appendix.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="49" end_page="50" type="metho">
    <SectionTitle>
2.3. WSJ-CSR Discussion
</SectionTitle>
    <Paragraph position="0"> In NIST's analyses of the results, displays of the range of reported word error rates for each speaker across all systems are sometimes informative. These displays tend to draw attention to particularly problematic speakers or systems.</Paragraph>
    <Paragraph position="1"> Figure 1 shows data for the 10 speakers and 11 systems participating in the required Hub 1 C1 test. The speakers have been ordered from low error rate at the top of the figure to high error rate at the bottom. The length of the plotted line indicates the range in word error rate reported over all systems, and the one.standard-deviation points about the mean are indicated with a &amp;quot;+&amp;quot; symbol.</Paragraph>
    <Paragraph position="2"> Note that three speakers (40h, 40j, and 40t) have unusually high error rates relative to the other seven in this test set.</Paragraph>
    <Paragraph position="3"> In previous tests involving the Resource Management Corpus, it was noted that high error rates seemed to be correlated, at least indirectly, with unusually fast or slow rate of speech. To see if this was the case for the present test data, NIST obtained estimates of the average speaking rate (words/minute) for each of the test speakers. These estimates were based solely on the total number of words uttered and the total duration of the waveform files, and more sophisticated measures would be desirable. Figure 2 shows a plot of the word error rate vs. speaking rate for the 10 speakers and 11 systems in the Hub 1 C1 test.</Paragraph>
    <Paragraph position="4"> This figure, like Figure 1, indicates that speakers 40h, 40j and 40f not only have unusually high error rates relative to the other speakers in this test set, but it also indicates that for these speakers, the speaking rate is markedly higher than for the other seven. Whereas the speaking rate for the seven speakers ranges from approximately 115 to 145 words/minute, for the three speakers with high error rate, the speaking rate ranges from 165-175 words/minute.</Paragraph>
    <Paragraph position="5"> There are at least two factors that may contribute to higher error rates at these fast speaking rates: within-word and across-word eoarticulatory effects (e.g., phone deletions) associated with fast (possibly better described as &amp;quot;careless&amp;quot; or &amp;quot;easuar') speech, and possible under-representation of these effects in the training material.</Paragraph>
    <Paragraph position="6"> Chase, et al. \[9\], at CMU, noted that for the 4 speakers in Spoke 7 (40g, 40h, 40i, and 40j), two (40g and 40i) could be subjectively characterized as &amp;quot;careful speaker\[s\]&amp;quot;, but that 40h was characterized as a &amp;quot;pretty fast speaker, \[with\] very low gain&amp;quot;, and 40j as a &amp;quot;very, very fast speaker&amp;quot;. These &amp;quot;fast speakers&amp;quot; appear in a number of the test sets.</Paragraph>
    <Paragraph position="7"> NIST's analyses of the distributions of rate of speech for two sets of training material for the Hub 1 test (each consisting of approximately 30,000 utterances: &amp;quot;short-term&amp;quot; and &amp;quot;long-term&amp;quot; speakers) indicate that the distributions are rather broad, with the short-term speakers' distribution peaking at 130 words/minute, with a standard deviation of 30 words/minute, and the long-term speakers' distribution peaking at 145 words/minute, with an associated standard deviation of 30 words/minute. Note that speaking rates for the 3 &amp;quot;fast-talking&amp;quot; speakers fall just outside the &amp;quot;plus one standard deviation region&amp;quot; range relative to the peak of the distribution for the &amp;quot;short-term speaker&amp;quot; training set, and just inside the corresponding region relative to the &amp;quot;long-term&amp;quot; training set.</Paragraph>
    <Paragraph position="8"> Because a number of the measured performance differences between systems were small, and the results of the paired-comparison significance tests validated the relevant null hypotheses, it has been observed that, in general, the use of larger test sets, especially for the Hub tests, would have been more informative, especially with regard to the results of significance tests requiring larger speaker populations (i.e., the Sign and Wileoxon Signed-Rank tests). With larger populations of test speakers, it would be less likely to have such disproportionately large representation of&amp;quot;fast speakers&amp;quot; in the test sets.</Paragraph>
    <Paragraph position="9"> Two spokes made use of microphones other than the &amp;quot;standard&amp;quot; Sennheiser close-talking microphone. (See, for example, the discussion in the Appendix of this paper for Spokes 5 and 6.) Too other spokes dealt with the issue of performance degradations that were presumably due to degradations in the signal-to-noise ratio. (See, for example, the discussion for Spokes 7 and 8.) For the test data of Spokes 5-7, subsequent to the completion of the tests, NIST performed signal-to-noise ratio (SNR) analyses, using three different bandwidth (signal preprocessing) conditions: broadband, A-weighted, and 300 Hz3000 kHz passband &amp;quot;telephone bandwidth&amp;quot;. The filtered SNR's are generally higher than the broadband values.</Paragraph>
    <Paragraph position="10"> Figure 3 shows the results of these SNR analyses.</Paragraph>
    <Paragraph position="11"> Figure 3 (a) indicates, the SNRs measured for the data of Spoke 5, which includes 10 &amp;quot;unknown&amp;quot; microphones in  addition to the simultaneously collected reference Sennheiser dose-talking microphone data for each data subset, collected in the normal data collection environment. SRrs &amp;quot;normal offices for recording&amp;quot; speech data have A-weighted sound level values in the 46.-48 dB range, There were 2 &amp;quot;tieelip&amp;quot; or lapel microphones, 5 stand-mounted microphones, a surfaceeffect microphone, a speakerphone, and a cordless telephone in this set of 10 test microphones.</Paragraph>
    <Paragraph position="12"> Note that the SNR values for the Sennheiser microphone are typically about 45 dB for the both the broadband and A-weighted conditions, indicating that there is little low-frequency energy in the spectrum of the noise in the Sennheiser microphone data. Sennheiser microphone data typically yield values of 50 dB for the telephone-bandwidth condition. For the alternate microphones, the broadband SNR's range from about 23 dB (for the Audio-Teehnica stand-mounted microphone) to 45 dB (for the GE cordless telephone). With filtration the SNR's are higher, as expected. Note that nearly all of the microphones provide at least a 30 dB telephone-bandwidth SNR, and that the AT Pro 7a lapel-mounted microphone provides approximately 40 dB.</Paragraph>
    <Paragraph position="13"> Figure 3 (b) indicates the measured SNR's for the data of Spoke 6, which includes 2 &amp;quot;known&amp;quot; alternate microphones in addition to the reference Sennheiser dose-talking microphone, collected in the normal data collection environment. For the Sennheiser dose-talking microphone, the broadband SNR's are, as for Spoke 5, 45-.46 dB. There is a substantial difference between the broadband and A-weighted SNRs for the Audio-Teehniea stand-mounted microphone, corresponding to low frequency noise picked up by this microphone, and for the telephone-bandwidth condition the SNR is approximately 35 dB. With the telephone handset, SNRs are 38 to 40 dB, depending on bandwidth.</Paragraph>
    <Paragraph position="14"> The test set data for Spoke 7, shown in Figure 3 (c), involved use of two different microphones (an Audio-Teehniea stand-mounted microphone and a telephone handset in addition to the usual &amp;quot;reference&amp;quot; Sennheiser dose-talking microphone), in two different noise environments, with background A-weighted noise levels of 58-68 dB.</Paragraph>
    <Paragraph position="15"> In the quieter of the two &amp;quot;noisy&amp;quot; environments, a computer laboratory with a reported A-weighted sound level in the 5859 dB range, the broadband SNR was approximately 34-36 dB for the Sennheiser microphone, and 35 dB for the telephone handset data, but only 17 dB for the Audio-Techniea microphone. Spectral analyses of the Audio-Teehniea background noise data demonstrate the presence of significant low frequency energy as well as the presence of harmonic components with an approximately 70 Hz fundamental. These components may have originated in some rotating machinery (e.g., a cooling fan or disc drive).</Paragraph>
    <Paragraph position="16"> In the noisier environment, a room containing machinery with conveyor belts for sorting packages, with a reported A-weighted sound level in the 62-68 dB range, the broadband SNR ratio for the Sennheiser data degraded to 27-29 dB (a decrease of approximately 7 dB), and 27 dB for the telephone handset data, and the Audio-Techniea to 16 dB (a decrease of only 1 dB). With A-weighting, in the quieter environment, the SNR for the Sennheiser improved very slightly (less than 1 dB, relative to the broad band values), and for the Audio-Techniea it was 25 dB, 8 dB higher than the broad band value.</Paragraph>
    <Paragraph position="17"> In the noisier environment, the A-weighted S/N ratio for the Sennheiser data was approximately 29 dB, and the Audio-Techniea 20 dB.</Paragraph>
    <Paragraph position="18"> For the telephone handset data, both the telephonebandwidth-filtered and the A-weighted SNRs were higher than, but typically within one or two dBs, of the unweighted values, as might be expected.</Paragraph>
    <Paragraph position="19"> In summary, for the quieter of the two environments used in collecting the data of Spoke 7, none of the data subsets in Spoke 7 had an average filtered SNR worse than about 25 dB, and in the noisier environment, the worst average filtered SNR for any data subset was approximately 20 dB. These SNR values would not ordinarily be regarded as indicative of severe noise-degradation.</Paragraph>
    <Paragraph position="20"> Spoke 8 involved data collected in the presence of competing noise -- music and talk radio broadcasts. For the case of competing music, the broadband SNR for the reference Sennheiser microphone ranged from 44 DB for the so-called &amp;quot;20 dB&amp;quot; condition, to 36 dB for the &amp;quot;10 dB&amp;quot; condition, and 29 dB for the &amp;quot;0 dB&amp;quot; condition. For the Audio-Technica microphone, corresponding measured valueswere 25, 17, and 11 dB. NISTs measurements of SNR for the data containing competing speech were inconclusive because of the difficulty of distinguishing between the spoken test material and the competing talk radio.</Paragraph>
  </Section>
  <Section position="6" start_page="50" end_page="51" type="metho">
    <SectionTitle>
3. ATIS TESTS
3.1. New Conditions
</SectionTitle>
    <Paragraph position="0"> Recent ATIS tests were similar in many respects to previous ATIS tests -- the primary difference consisting of expansion of the size of the relational air-travel-information database to 46 cities, and use of a body of newly collected and annotated data using this relational database \[I0\]. As in prior years, tests included spontaneous speech recognition (SPREC) tests, natural language understanding (NL) tests and spoken language understanding (SIS) tests. For the first time, data collected at NIST was included in the test and training data.</Paragraph>
    <Paragraph position="1"> The NIST data was collected using systems provided to NIST by BBN and SRL In previous years, results for NL and SLS tests were presented and discussed in terms of a &amp;quot;weighted error&amp;quot;  percentage, which was computed as twice the percentage of incorrect answers plus the percentage of &amp;quot;No Answer&amp;quot; responses. The decision to weight 'kvrong answers&amp;quot; twice as heavily as &amp;quot;no answer&amp;quot; responses was reconsidered within the past year by the ARPA Program Manager, and this year only unweighted NL and SLS errors are reported (i.e., incorrect answers count the same as &amp;quot;No Answer n responses). For most system developers, this change of policy has appeared to result in changed strategies for system responses, so that in this year's reported results, little use was made of the &amp;quot;No Answer&amp;quot; response.</Paragraph>
    <Section position="1" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
3.2. Summary Highlights
</SectionTitle>
      <Paragraph position="0"> For the recent ATIS tests, results were reported for systems at seven sites. Lowest error rates were reported by the group at CMU \[11\]. The magnitude of the differences between systems is frequently small, and the significance of these small differences is not known.</Paragraph>
      <Paragraph position="1"> As in previous years, error rates for &amp;quot;volunteers n are generally higher than for ARPA eontraetors, possibly reflecting a lesser level-of-effort.</Paragraph>
      <Paragraph position="2"> Additional details about the test paradigm, and comments on some aspeets by individual partieipants, are found in another paper in this Proceedings, by Dahl, et al., on behalf of members of the ARPA Multi-site ATIS Data COllection Working (MADCOW) Group \[10\]. Details about the technical approaches used by the partieipants, and their own analyses and comments, are to be found in references \[11,2328\]. null</Paragraph>
    </Section>
    <Section position="2" start_page="51" end_page="51" type="sub_section">
      <SectionTitle>
3.3. ATIS Discussion
</SectionTitle>
      <Paragraph position="0"> This year, 46% of the utterances were classified as Class A and 34% in Class D, so that 80% of the test utterances were &amp;quot;answerable&amp;quot; (i.e., Class A or D). Last year's test set had about the same percentage of Class A queries (43%), but somewhat fewer classified as Class D (i.e., 25%), so that last year only 67% were answerable. One possible reason for this change (other than the test-set-to-test-set fluctuations) may be that the Principles of Interpretation document is continually being extended to cover phenomena that would have otherwise resulted in eategorization of some queries as &amp;quot;unanswerable&amp;quot;, and therefore Class X.</Paragraph>
      <Paragraph position="1"> For text input (NL test), for last year's test material, the lowest unweighted NL error rate was 6.5% for the Class A+D subset, 6.5% for Class A, and 6.4% for Class D, in contrast with this year's corresponding figures of 9.3%, 6.0% and 13.8%. Note that this year's test set apparently had &amp;quot;more diffieult&amp;quot; Class D queries, and that there was a larger fraction of the queries that were classified as Class D than last year (34% vs. 25%).</Paragraph>
      <Paragraph position="2"> For speech input (SLS test), and for last year's unweighted test material, the unweighted SLS error rate was 11.0% for the Class A+D subset, 10.2% for Class A, and 12.5% for Class D, in contrast with this year's corresponding figures of 13.2%, 8.9% and 17.5%.</Paragraph>
      <Paragraph position="3"> Note that while the lowest error rate for Class A queries is smaller this year (i.e., 8.9% vs. 10.2%), this year's best Class D error rate was substantially higher than last year's. It may be the ease that this is related to the extended coverage provided by the current Principles of Interpretation document, so that queries that in previous years would have been classified as unanswerable, are now judged to be answerable, although context-dependent.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="51" end_page="52" type="metho">
    <SectionTitle>
4. ACKNOWLEDGEMENTS
</SectionTitle>
    <Paragraph position="0"> The &amp;quot;Hub and Spokes&amp;quot; Test paradigm could not have been developed, specified, or implemented without the tireless and effective efforts of Francis Kubala, as Chair of the ARPA continuous speech recognition Corpus Collection Coordinating Committee (CCCC). The tests would also not have been possible without the dedicated efforts of Denise Danielson and her colleagues at SRI in collecting an exceptionally large and varied amount of CSR data for CCSR system training and test purposes. In the ATIS community, Debbie Dahl served as Chair of the MADCOW group, and it is to her credit that new data was collected at several sites with the 46 eity relational database and that participating sites reached agreement on the details of the current tests.</Paragraph>
    <Paragraph position="1"> Kate Hunicke-Smith and her colleagues at SRI International were again responsible for annotation of ATIS data and for assisting NIST in the adjudication process following preliminary scoring. It is a pleasure to acknowledge Kate's thoughtful and cheerful interactions with our group at NIST.</Paragraph>
    <Paragraph position="2"> As in previous years, the cooperation of many participants in the ARPA data and test infrastructure -- typically several individuals at each site -- is gratefully acknowledged.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML