File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2007_intro.xml

Size: 2,375 bytes

Last Modified: 2025-10-06 14:04:50

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2007">
  <Title>Modelling Non-verbal Sounds for Speech Recognition</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Recent experiments performed by two groups of researchers at CMU have gathered data on subjects using speech recognizers in office-like environments (Rudnicky, et al., 1989, Stern &amp; Acero, 1989). These experiments are presented by the authors in these proceedings. Among other things, they show that non-verbal events (non-stationary noises) do create serious problems for speech recognizers. These sounds are generated both by the speaker and by the environment. Examples of noise generated by the speaker are breath noises, lip smacks, paper rustles, filled pauses, cough, clearing throat, etc. Environmental noise can be phone rings, door slams, other speakers in the background, typing, etc. We attempt to explicitly model classes of noise represented by these events in the context of an HMM based speech recognizer (Sphinx). Subjects were recorded performing the two tasks, spreadsheet and census data (alphanumeric). A significant percentage (approx 10% overall in each task) of the utterances contain phenomena of the type mentioned above. The utterances were transcribed using a set of noise words to represent non-signal events in the recording. Fourteen noise words were used.&amp;quot; AH, BEEP, BREATI-INOISE, CLEAR_THROAT, COUGH, DOOR_SLAM, MOUTFLNOISE, MUMBLE, RUSTLE, PHONE_RING, SNIFF, SNEEZE, TAP and THUMP. For each of these noise classes, a phone was added to the phone set and a word consisting of only that phone was added to the lexicon. The standard Sphinx training routines were then used to train context dependent models for all phones except those representing noise. Context independent models were used for the noise phones. The simple word models for noise give no context since they are single tokens, and we did not use between-word models. For recognition, noise words are treated like Silence words. They are allowed to occur after any word, including themselves and other noise words. We use the Sphinx recognizer with only minor modifications to implement transitions to noise words and to allow utterances that are only noise or Silence.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML