File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/91/h91-1027_metho.xml

Size: 4,455 bytes

Last Modified: 2025-10-06 14:12:44

<?xml version="1.0" standalone="yes"?>
<Paper uid="H91-1027">
  <Title>SESSION 4: SPEECH I</Title>
  <Section position="3" start_page="0" end_page="159" type="metho">
    <SectionTitle>
SUMMARY OF PRESENTATIONS AND
DISCUSSION
</SectionTitle>
    <Paragraph position="0"> The first paper, &amp;quot;Field Test Evaluations and Optimizations of Speaker Independent Speech Recognition for Telephone Applications,&amp;quot; by Gagnoulet and Sorin of CNET, was presented by Christel Sorin. This paper discussed various ways of improving system usability and performance by optimizing both the dialog ergonomy and the recognition technology within the constraints of low-cost real-time implementation.</Paragraph>
    <Paragraph position="1"> Techniques discussed included use of field data in training, increasing the number of parameters, automatic adjustments of the HMM structure, and better rejection procedures. A brief discussion of the rejection rate versus error rate tradeoff ensued; nobody had any good data or ideas on how to make this tradeoff, so when one person suggested that the rejection rate should be adjusted to keep the error rate under 5 percent, we said OK and moved on.</Paragraph>
    <Paragraph position="2"> The second paper, &amp;quot;Collection and Analysis of Data From Real Users: Implications for Speech Recognition/ Understanding Systems,&amp;quot; by Judith Spitz and the AI Speech group at NYNEX, concentrated on analyzing user response characteristics as a function of the prompts used, and on comparing user versus laboratory speech characteristics with respect to their effects on recognition performance. Since NYNEX has gone to the trouble of collecting lots of good data, including TIM\[I&amp;quot; data run through the telephone network, there was some discussion of the possibility of distributing some of their data, such as the Network-TIMrr data and telephone services data, through NIST. Legal issues are the most serious problem at this point for the telephone services data, since it is not possible to get explicit consent from the talkers.</Paragraph>
    <Paragraph position="3"> The third paper, &amp;quot;Autodirective Microphone Systems for Natural Communication with Speech Recognizers,&amp;quot; by Flanagan, Mammone, and Elko of Rutgers University, was presented by Jim Flanagan. He surveyed recent advances and opportunities in steerable-beam microphone arrays with automatic source tracking. An audio tape demonstrated excellent-quality recording from a single speaker in a 300-seat auditorium using a 2D array on the ceiling. A video tape showed the 1D array used in the HuManNet system. The relative merits of noise cancellation filters and steerable beams were discussed, and it was suggested that noise cancellation may actually be a much more useful technique when combined with a steerable microphone array.</Paragraph>
    <Paragraph position="4"> The final paper, &amp;quot;Signal Representation, Attribute Extraction, and the Use of Distinctive Features for Phonetic Classification,&amp;quot; by Meng, Zue, and Leung of MIT's Laboratory for Computer Science, was presented by Helen Meng. This presentation covered results of careful experimental comparisons of different front-end representations (e.g.</Paragraph>
    <Paragraph position="5"> auditory, Mel-cepstrum, DFT, etc.) and various ways of incorporating acoustic attributes and distinctive features, in the context of multilayer peroeptron based vowel classification.</Paragraph>
    <Paragraph position="6"> The dual auditory model (mean rate plus synchrony representations) worked best as the front end (especially in noise, but by an insignificant margin in some other cases).</Paragraph>
    <Paragraph position="7"> Significant computational savings was possible by reducing the front-end output to a few simple acoustic attributes, and the loss in accuracy was small and probably insignificant. Using distinctive features was said to provide for the possibility of better phonological-level generalization; the loss in accuracy of incorporating features (followed by a second MLP to do the phoneme classification) was not significant. Discussion followed on possible explanations for why the auditory model works as well as it does; nobody had a sukable explanation, but the conjecture that it was primarily due to the synchrony information was shown to be not supported by the data, since the rate-only model worked almost as well.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML