File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/89/h89-1018_metho.xml

Size: 14,469 bytes

Last Modified: 2025-10-06 14:12:19

<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-1018">
  <Title>UNDERSTANDING SPONTANEOUS SPEECH</Title>
  <Section position="4" start_page="137" end_page="137" type="metho">
    <SectionTitle>
LIMITATIONS OF CURRENT RECOGNIZERS
</SectionTitle>
    <Paragraph position="0"> Current state-of-the-art speech recognition systems make several assumptions about the input in order to increase performance:  These assumptions allow the system to enforce constraints of continuity and gramaticallity. That is, they attempt to find a grammatical sequence of words that spans the entire utterance. Some word model (or silence) must be matched against all areas of the input. The input is searched left-to-fight for legal sequences of words. Previously recognized word boundaries are used as the starting point for subsequent words and only words constituting legal extensions of current paths are considered. Legal word sequences are defined by a language model. This model may be a grammar or sequence transition probabilities derived from a corpus. If the recognizer does not correctly recognize a portion of the input, for subsequent portions of the input it is no longer searching for the correct words at the correct boundaries. This leads to misrecognition, and the user has no option but to repeat the sentence, perhaps rephrasing it.</Paragraph>
    <Paragraph position="1"> These constraints serve to reduce the search space for an utterance. Giving up grammar constraints during recognition may allow the system to recover more quickly after an error, but there will be more errors in well-formed utterances due to lesser constraint and the resulting strings must still be parsed. Likewise, word-spotting (starting every word at every frame) to produce a word lattice is not enough. Words must still be joined into sequences to form a sentence. It is neccessary to allow interruptions in the grammar and in the recognition. The recognizer must be allowed to search for words that do not form grammatical extensions of a current hypothesis. It must also allow some areas to go unmatched (in the case of an unknown word).</Paragraph>
  </Section>
  <Section position="5" start_page="137" end_page="138" type="metho">
    <SectionTitle>
TECHNIQUES FOR TEXT INPUT
</SectionTitle>
    <Paragraph position="0"> Many of the same types of problems exist in typed natural language interfaces. Work has previously been done on parsing typed extra-grammatical input of this sort (Carbonell &amp; Hayes 1984, Hayes &amp; Carbonell 1981, Weischedel  &amp; Black 1980, Weischedel &amp; Sondheimer 1987). Hindle (1983) processed transcripts of speech using a Mracusstyle parser. This work basically represents two approaches to handling ill-formed input: 1. Look for patterns in the syntax and have an associated action for each pattern. These methods require finding the &amp;quot;editing signal&amp;quot; which indicates a specific pattern that the system knows how to recover from.</Paragraph>
    <Paragraph position="1"> 2. Look for gaps or redundancies in the semantics. Account for as much of the input as possible and then  use the overall semantics to help define the proper response.</Paragraph>
    <Paragraph position="2"> Carbonell &amp; Hayes (1984) point out the importance of semantic information in parsing extra-grammatical input. The notion is to &amp;quot;step back&amp;quot;, that is look at the other portions of the utterance and look for gaps or repetitions in  semantic information. They discuss the suitability of three general parsing strategies for recovering from ill-formed input and ellipsis.</Paragraph>
    <Paragraph position="3"> * Network Parsers - These include ATN's and semantic grammars. It is very hard to &amp;quot;step back and take a broad view&amp;quot; with these parsers. Too much is encoded locally in state information. Networks are naturally top-down left-to-fight oriented.</Paragraph>
    <Paragraph position="4"> * Pattern Matching Parsers - Partial pattern matches can be allowed which gives some ability to &amp;quot;step back&amp;quot;,abut there is no natural way to differentiate between how important constituents are. That is, the grammar is &amp;quot;uniformly represented&amp;quot;.</Paragraph>
    <Paragraph position="5"> * Case Frame Parsers - These allow the ability to &amp;quot;step back&amp;quot;. They provide a convienient mechanism for  using semantic and pragmatic information. Semantic components or cases can be compared instead of syntactic structures. &amp;quot;In brief, the encoding of domain semantics and canonical structure for multiple surface manifestations makes case frame instantiation a much better basis for robust resolution than semantic grammars.&amp;quot;  The general idea is to isolate the error and use recognized areas on both sides to give more information as to what is missing or repeated. The entire utterance is parsed, filling in as much of the case frame as possible. If there is unparsed input and the frame is complete, the input can be treated as spurious. If there is a gap in the structure (unfilled elements) then the unrecognized element was probably a filler for that component. If the same case is filled by more than one element, then the first can be ignored. The user should be made aware of any of these conditions. If there is a gap in the semantics, the system must engage in a clarification dialog with the user. This interaction can be very focused since the system now has an expectation of the semantic type that is missing. Unfortunately, we cannot use their recovery strategies directly. We wish to use grammar predictively to constrain the word search. In speech the correct input string is not known and only strings that are searched for are produced. For example, it is obvious in a typed interface when the system is given an unknown word. A speech recognizer will never produce a word not in its lexicon. The effect of an unknown word in the input is that all words in the system lexicon that are legal extensions of current paths are matched against that area of the input. Those that match sufficiently well will extend their paths across the area, but the correct word will of course not be searched for. Unless some other word has an acceptable acoustic match and similar grammatical role, no path will be correctly aligned with the input. Similarly, such a system will never produce a restart sequence unless it is specifically searched for. As in the text input systems, we wish to use sentence fragments on both sides of a problem area to help determine what is missing. This means being able to recognize portions of the utterance that follow an unrecognized region. For this we must depart from the strict left-to-right grammatical extension control strategy.</Paragraph>
  </Section>
  <Section position="6" start_page="138" end_page="139" type="metho">
    <SectionTitle>
PROCESSING SPONTANEOUS SPEECH
</SectionTitle>
    <Paragraph position="0"> At CMU we are developing a system (called Phoenix) for recognizing spontaneous speech. This system uses the HMM word models developed in the Sphinx system (Lee 1989). It relies on specific modelling of acoustic features and a flexible control structure to process natural speech. We are currently implementing this system for a spreadsheet task.</Paragraph>
    <Paragraph position="1"> We want to specifically model the acoustic features of spontaneous speech. This includes phenomena like lengthening phonemes and filled pauses. We created new phonemes and words for several classes of filled pauses(uh, er, um, ah, etc). We are gathering a corpus of spontaneous speech for users engaged in a spreadsheet task. The phone models for the system will be trained on this corpus. This training will be in addition to, not instead of the current training set.</Paragraph>
    <Paragraph position="2"> The control structure for the recognizer is based on recognizing phrases rather than sentences. Input is viewed as a series of phrases instead of sentences with well defined boundaries. The system has a grammar which defines legal word sequences. These represent complete sentences as well as phrases which aren't embedded in a sentence. A phrase may be as short as a word or as long as a complete sentence. The system has a set of &amp;quot;meanings&amp;quot; or concepts which represent the information to be transferred. Each meaning is represented by a network that contains all surface strings or phrases for expressing the concept. Additionally there are semantic structures which represent the actions that the system can take. These structures are very similar to case frames in that they contain slots for meanings or information required to complete an action. Unusual constituent ordering is allowed by allowing meanings within a structure to occur in any order.</Paragraph>
    <Paragraph position="3"> The input is processed left-to-right using the grammar to search for phrases. All phrases are searched for after detection of a pause or interrnption. Phrases are not deleted when they can no longer be extended. As phrases are recognized, they are assigned a meaning and attached to the appropriate semantic structures. A single phrase or sequence of phrases may be necessary to complete the semantics of a structure. No single structure may contain phrases overlapping in time and multiple structures may be competing for instantiation.</Paragraph>
    <Paragraph position="4"> The idea is to concentrate on recognition of &amp;quot;meaning units&amp;quot; not sentences. Phrases themselves must be well-formed but need not combine into a grammatical sentence. Grammar is used as a local constraint to govern the grouping of words into phrases. Global constraints come from the semantics of the system which govern the combining of a sequence of meanings into a defined action.</Paragraph>
    <Paragraph position="5"> With this system we can process spoken input with strategies similar to those used by CarboneU &amp; Hayes. Here  either an incorrect word or an unmatched area. These words may be important, that is represent semantics necessary for interpreting the utterance, or they may be extraneous. If they are extraneous, the frame will be complete and they may be ignored. If they are important, there will be a gap in the semantics. A slot will be unfilled in an otherwise complete frame.</Paragraph>
    <Paragraph position="6"> * Spurious words or phrases - These will leave part of the input unaccounted for but the utterance will be semantically complete.</Paragraph>
    <Paragraph position="7"> * Restarts - The restarted phrase may be truncated or complete. If complete, the structure will have two phrases competing for the same slot. In this case, the first phrase can be ignored. In the case of a truncated phrase, the structure will have a gap in its coverage of the input but the semantics will be complete. In this case the truncated phrase is ignored. Truncated phrases are an explicit signal to look for a restart.</Paragraph>
    <Paragraph position="8"> * Out of order constituents - are not a problem since no ordering is imposed.</Paragraph>
    <Paragraph position="9"> * Elliptical or telegraphic input - The system naturally recognizes these. They represent speaking only the neccessary information with minimal phrasing. Semantic structures provide a convienient mechanism for specifying what is &amp;quot;understood&amp;quot; in a situation and therefore can be left out of the utterance. As an example, consider processing a restarted phrase like &amp;quot;go down a screen .. screen's worth&amp;quot;. This is an example of a PAGE command with the slots \[move-up\] \[integer\] \[screen\]. The individual phrases are recognized as  the \[screen\] meaning superseedes the first giving the correct interpretation &amp;quot;go down a screen's worth&amp;quot;. It is not sufficient to simply ignore unrecognized areas without classifying them. Consider the sequence &amp;quot;under finance enter fifty dollars ... under utilities enter thirty dollars .. under credit card enter ten dollars&amp;quot;. If &amp;quot;finance&amp;quot; is not in the lexicon (and therefore not recognized), the system can't simply ignore it and go on. This would result in the erroneous parse &amp;quot;enter fifty dollars under utilities&amp;quot;. This sort of problem is less severe in an interactive situation than when processing in the background. Prosodic cues can be very useful in resolving this type of situation. Initially we are filtering out filled pauses, interjections and cue phrases. The only prosodic features used are pauses. Later we will incorporate these into the system since they are useful in resolving ambiguous situations. In the last example, if the input had been &amp;quot;under finance enter fifty dollars .. okay., under utilities enter thirty dollars .. fine, now under credit card enter ten dollars&amp;quot;, the cue phrases &amp;quot;okay&amp;quot; and &amp;quot;fine now&amp;quot; would indicate that &amp;quot;enter fifty dollars&amp;quot; associated with some unrecognized item (&amp;quot;finance&amp;quot;) while &amp;quot;enter thirty dollars&amp;quot; associates with &amp;quot;utilities&amp;quot;. Recovery cannot always be automatic. It will sometimes be neccessary to interact with the user to resolve the problem. However, since the system has information as to what is most likely missing (the unfilled slots) the interaction can be much more focused than a general request to repeat or paraphrase.</Paragraph>
    <Paragraph position="10"> In order to deal with unknown or mispronounced words, we must have better estimates of the quality of a recognized string. Currently most recognizers represent a path by a single score which represents its overall quality. There is no indication of whether some parts of the input are very good matches and others very poor or the quality was fairly uniform. The quality of the acoustic match can be monitored at several levels (vq, state, phoneme, word, phrase, structure) and the resulting pattern used to help classify the recognition. Quality is a relative term here. We propose to keep running means and variances for the speaker at each of these levels so that variances from the norm for this speaker not absolute measures will be used. This will aid the system in detecting when a correct path is going awry. The system will of course not produce an unknown word but it can detect that no acceptable matches are found for a region.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML