XML Viewer - h92-1068

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/92/h92-1068_metho.xml
Size: 10,589 bytes
Last Modified: 2025-10-06 14:13:07
<?xml version="1.0" standalone="yes"?>
<Paper uid="H92-1068">
  <Title>Spontaneous Speech Effects In Large Vocabulary Speech Recognition Applications</Title>
  <Section position="3" start_page="0" end_page="339" type="metho">
    <SectionTitle>
2. ERROR ANALYSIS
</SectionTitle>
    <Paragraph position="0"> We begin by analyzing the errors that occurred in the February 1991 evaluation set of 148 Class-A sentences, for which our recognition word error rate exceeded 18%.</Paragraph>
    <Paragraph position="1"> These sentences are examined because they are believed to be a particularly difficult sampling of ATIS speech.</Paragraph>
    <Paragraph position="2">  Phonetic alignments were automatically generated corresponding to both the reference and recognized word strings, and we: listened to each utterance was listened to very carefully. The acoustic and language model scores were compared, and a subjective judgment was made as to the likely source of the error (the acoustic model, the language model, the articulation quality of the segment, or other effects such as breaths, out-of-vocabulary words, or extraneous noise).</Paragraph>
    <Paragraph position="3"> We found that 30% of the errors (Table 1) could be attributed to poor articulation or poorly modeled articulation (usually reductions, emphatic stress, or speaking rate variations), 20% were due to out-of-vocabulary words or poor bigram probabilities, 20% were due to urnnodeled pause-fillers (uh, um, breaths), and the remaining portion unexplainable, but probably due to inadequate acoustic-phonetic modeling.</Paragraph>
    <Paragraph position="4"> We see that 70% of the errors are due to effects observed in the ATIS domain, but not in the RM domain. If these errors were removed, we would approach an error rate typically seen in a comparable RM system (with a perplexity 60 wordpair grammar).</Paragraph>
  </Section>
  <Section position="4" start_page="339" end_page="340" type="metho">
    <SectionTitle>
Feb91 ATIS evaluation set (148 sentences).
3. READ VS. SPONTANEOUS SPEECH
</SectionTitle>
    <Paragraph position="0"> To deterrnme the impact of spontaneous versus read speaking styles on recognition performance given a fixed training condition, a recognition experiment with two test sets was constructed. The first set contained spontaneous speech utterances; the second set contained read versions of those same utterances, given later by the same subjects.</Paragraph>
    <Paragraph position="1"> The training data consisted ofRM, TIMIT, and pilot-corpus ATIS utterances (with the read-spontaneous and spontaneous test data held out). This left rather little ATIS-specific data for training, almost none of it spontaneous. The recognition was run without a grammar (perplexity 1025) to remove any corrective effects of the grammar, so that only the acoustic effect of the spontaneous speech could be evaluated. The spontaneous test sentences were categorized as either fluent or disfluent based on the existence of special markings in their corresponding SRO* files.</Paragraph>
    <Paragraph position="2"> We found that the primary difference in error rates between the read and spontaneous test sets was due directly to disfluencies in the spontaneous speech (Table 2). Non-disfluent spontaneous speech had the same error rate as read speech.</Paragraph>
    <Paragraph position="3"> The disfluencies include pause-fillers, word fragments, overly lengthened or overly stressed function words, selfedits, mispronunciations, and overly long pauses. This list of disfluency types is derived from the special markings used in the SRO transcriptions. The observation that nondisfiuent spontaneous speech error rate approaches read speech error rate is consistent with the fact that the test speech much more closely resembles the training data. The training data was fluently and consistently articulated, just as was the non-disfluent spontaneous speech.</Paragraph>
    <Paragraph position="4">  speech and fluent spontaneous speech have equivalent error rates.</Paragraph>
    <Paragraph position="5"> The breakdown of error rate versus disfluency type (Table 3) shows that a significant portion of the errors were due to filled pauses, long pauses, lengthenings, and stress. Sentences with these disfluencies had twice the word error rate of fluent speech. The filled pause errors happened because there were no models for breath/uh/um events in this particular recognizer's lexicon. The stress and lengthening errors happened (most likely) because of the lack of sufficient observations of these events in the training data, and because of the lack of explicit models for these effects. The long pauses usually caused insertions within the pause regions neighboring the phrase-initial and phrase-final words.</Paragraph>
    <Paragraph position="6"> From these observations, we conclude that more training data containing these effects would improve the match between the acoustic models and the spontaneous test speech, and therefore would improve the recognition performance. Furthermore, these effects should be explicitly modeled m the recognizer's lexicon, once sufficient training data is obtained. However, this process depends on the reliability of the SRO labeling across sites, which tends to be subjective and inconsistent.</Paragraph>
    <Paragraph position="7"> *The SRO transcription contains a detailed description of all the acoustic events occurring in a utterance.</Paragraph>
    <Paragraph position="8">  disfluency type, and the percentage of occurrences where the disfluency causes an error.</Paragraph>
  </Section>
  <Section position="5" start_page="340" end_page="341" type="metho">
    <SectionTitle>
4. TRAINING DATA VARIATIONS
</SectionTitle>
    <Paragraph position="0"> Further evidence for the importance of modeling spontaneous phenomena is found by manipulating the content of the training data sets that are used for acoustic-phonetic modeling. In this experiment, we compare spontaneous speech recognition performance given different combinations of read, spontaneous, ATIS, and non-ATIS training subsets.</Paragraph>
    <Paragraph position="1"> The training subsets (Table 4) consist of the standard RM and TIMIT training data, and read and spontaneous subdivisions of all the ATIS and MADCOW data available as of October 1, 1991. The &amp;quot;Breaths&amp;quot; corpus refers to an internally collected database of inhalations and exhalations, used to train a breath model, which is allowed to occur optionally between words during recognition. Much of the ATIS-read data was also collected intemally at SRI.</Paragraph>
    <Paragraph position="2">  in various ways to determine the impact of read and spontaneous training data on recognition of spontaneous speech.</Paragraph>
    <Paragraph position="3"> Recognition was conducted using a development test-set of 447 spontaneous MADCOW utterances \[3\], with a perplexity 20 bigram grammar trained on all the available spontaneous speech transcriptions (roughly 10,000 sentences). All of the experiments outlined below use discrete-distribution HMMs, and every training set combination includes the 800 breath utterances.</Paragraph>
    <Paragraph position="4"> Using all the available ATIS and MADCOW data yielded a system with a word error rate of 9.6% (Table 5). Using only spontaneous ATIS speech reduced performance by only 6%, to 10.2% word error. Training with a roughly equivalent quantity of read ATIS speech increased the error rate significantly, by 58% to 15.2%. This suggests that having gaining data which is consistent in speaking mode with the test data can significantly improve performance. However, the effect of lexical and phonetic coverage in the training sets might be a factor in causing this performance difference. This issue is discussed in Section 5.</Paragraph>
    <Paragraph position="5">  This table indicates that having speaking-mode-consistent data is a major contributor to performance improvement.</Paragraph>
    <Paragraph position="6"> We also look at the impact of using non-ATIS read speech for additional training data (Table 6). Using successively more training data gives the expected result, an improvement in performance. However, when using all the available data (RM, TIMIT, ATIS and MADCOW), the performance matches that of the system gained exclusively on ATIS and MADCOW data. Furthermore, the performance of the system trained using all the available read speech (16,922 sentences) performed much worse than the system gained only on spontaneous speech (7,545 sentences).</Paragraph>
    <Paragraph position="7">  The error rates is reduced when ATIS-read data is added, and is reduced further when ATIS-spontaneous data is added.</Paragraph>
    <Paragraph position="8">  We can conclude from these experiments that having speaking-mode-consistent training data is more important than simply having a large quantity of training data. However, we cannot be certain that the phonetic content of the ATIS-spontaneous training set better matches the development set than the ATIS-read training set. This issue is addressed in the next section.</Paragraph>
    <Paragraph position="9"> We compared the errors of two different recognizers used on the same test set of spontaneous speech. Both recognizers were trained on a comparable number of utterances, but one recognizer was trained on read speech only (TIM1T+R-M+ATIS-Read), and the other on spontaneous speech only (ATIS-Spontaneous). We found that substitutions of one function word for another form a significant portion of the errors in both test sets, and in roughly the same proportions. However, there were significantly fewer substitutions of content words for other content words for the recognizer trained on spontaneous speech compared to the recognizer trained on read speech.</Paragraph>
    <Paragraph position="10"> Similarly, the recognizer trained on spontaneous speech manifested significantly fewer errors in substitution of a pause filler for a function word. &amp;quot;Homophone&amp;quot; errors, which can lead to understanding errors, formed a significant portion of the errors in the recognizer trained on read speech, although almost none of these appeared for the recognizer trained on spontaneous speech. We believe that this is because many words that can be homophonous in read speech (&amp;quot;for&amp;quot;-&amp;quot;four&amp;quot; and &amp;quot;to&amp;quot;-&amp;quot;two&amp;quot;, for example) are no longer homophones in spontaneous speech (&amp;quot;fer&amp;quot;-&amp;quot;four&amp;quot; and &amp;quot;tuh&amp;quot;-&amp;quot;two&amp;quot;).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML