File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/92/h92-1077_abstr.xml

Size: 6,819 bytes

Last Modified: 2025-10-06 13:47:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1077">
  <Title>SESSION 12: CONTINUOUS SPEECH RECOGNITION AND EVALUATION II*</Title>
  <Section position="1" start_page="0" end_page="379" type="abstr">
    <SectionTitle>
SESSION 12: CONTINUOUS SPEECH
RECOGNITION AND EVALUATION II*
</SectionTitle>
    <Paragraph position="0"> This session featured a summary of the first dry run benchmark tests on the new Wall Street Journal (WSJ) continuous speech recognition (CSR) pilot corpus, and a description of the techniques used and lessons learned by the four sites who conducted the large vocabulary CSR tests.</Paragraph>
    <Paragraph position="1"> For the first presentation, Dave Pallett distributed a handout with system descriptions and results. He credited the people involved, and indicated the tight schedule which was met. The tests included three training paradigms: speaker-dependent (SD); longitudinal speaker-dependent (LSD), with much more training speech; and speaker-independent (SI). Tests included 5K and 20K vocabularies, bigram and trigram language models, and recognition on speech collected with verbalized punctuation (VP) and with no verbalized punctuation (NVP). Data was shown indicating special difficulties with a few of the speakers. Results were presented on signal-to-noise ratios for both the primary and secondary microphones. The data are summarized in Pallett's Proceedings paper. Some comments on the results are given below.</Paragraph>
    <Paragraph position="2"> The next four papers, on recognition of the WSJ data at Dragon, CMU, Lincoln Laboratory, and SKI, included the common theme that extending a CSR system to a much larger vocabulary and more general task domain required more than a new dictionary and language model. In particular, major increases in search time, computation for matching, and memory utilization required each site to make compromises or revise strategies in acoustic modelling, search, and matching strategies. Despite the preliminary nature of the work on this corpus, encouraging results were obtMned and important issues were rMsed.</Paragraph>
    <Paragraph position="3"> The Dragon paper was presented by Francesco Scattone, and described two recognition approaches that were developed and tested. The first method utilized unimodal phonetic elements (PELs), and the second a variation of tied mixtures very recently implemented at Dragon, which *This work was sponsored by the Defense Advanced lq.esearch Projects Agency. The views expressed axe those of the author and do not reflect the official policy or position of the U.S. Government.</Paragraph>
    <Paragraph position="4"> was used in Dragon's dry run evaluation test on the 5,000 word SD portion of the corpus. The tied-mixture models proved very effective in modelling the multi-modality of parameter distributions, and generally yielded better recognition results. Scattone indicated that future work will focus on further development of the tied-mixture techniques, including efforts to develop high-performance speaker-independent recognition techniques.</Paragraph>
    <Paragraph position="5"> Next, Fil Alleva discussed the application of CMU's SPHINX-II system to the WSJ CSR task. An important change to SPHINX-II which was made to reduce running time was to use only left-context-dependent cross word models; in addition, a number of changes were made to the Viterbi search to reduce running time. Tests were run on a variety of conditions, including the spontaneous speech, and results are summarized in the paper.</Paragraph>
    <Paragraph position="6"> The next paper, by Doug Paul, described substantial changes made to the Lincoln Tied-Mixture HMM CSR, to achieve effective operation for the large-vocabulary CSR task. The recognizer, which had previously used a time-synchronous beam-pruned search, was converted to a stack-decoder-based search strategy with an acoustic fast match. Cross-word models had not yet been included.</Paragraph>
    <Paragraph position="7"> The stack decoder strategy was shown to perform effectively for the larger vocabularies, and a variety of development test and evaluation test results were presented. In addition, a rapid speaker enrollment procedure was described, and positive (but preliminary) results on rapid adaptation (using the standard WSJ 40 adaptation sentences) were presented. A discussion followed, focusing on the language modelling, and on perplexity for closed and open vocabularies.</Paragraph>
    <Paragraph position="8"> Hy Murveit described the application of SRI's DECIPHER system to the WSJ CSR task. He focused primarily on performance, since the CSR system used was essentially identical to the system used in ATIS. He acknowledged help from Dragon (Lexicon) and Lincoln (Language Models) in porting to the WSJ task. He described how DECIPHER was stripped down to reduce computation for the task. Tests on the secondary microphone were described, with about 40% increase in error rate. An experiment was described to investigate the effects of additional  SI training data. The experiment indicated that substantial increases in SI training data could produce significant reductions in error rate relative to those reported in the dry run evaluation tests.</Paragraph>
    <Paragraph position="9"> The chairman initiated the discussion period which followed this final presentation by presenting a plot of error rate vs perplexity for the WSJ dry run tests, the previous best resource management (RM) results, and CSR dictation results which had been presented by IBM at ICASSP-89 (Bahl, et.al., Large Vocabulary Natural Language Continuous Speech Recognition). For perplexity80, the WSJ error rates ranged from 9.0% (LSD) to 12.9% (best SO) to 16.6% (best SI). These error rates were considerably higher than the most recent perplexity 60 RM results (1.8% for SD) and (3.8% for SI), but not as much higher than the perplexity-90 SD IBM results (an 11% error rate was reported in the ICASSP-89 Proceedings paper, and an improved error rate of about 5% was presented at the ICASSP-89 talk). With the understanding that results obtained in these different tests are not directly comparable, still a fair conclusion which could be drawn is that the WSJ corpus is a sufficiently-chailenging one (especially when 20K vocabularies, spontaneous speech, and secondary microphones are considered), and that the results of the first dry run test were quite encouraging.</Paragraph>
    <Paragraph position="10"> Most of the ensuing discussion focused on the WSJ corpus and evaluation issues which George Doddington had listed in his earlier CCCC talk. These are summarized below by topic.</Paragraph>
    <Paragraph position="11"> In summarizing the discussion, an attempt is made to sample the range of comments and issues raised.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML