File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/a97-1008_metho.xml
Size: 19,792 bytes
Last Modified: 2025-10-06 14:14:31
<?xml version="1.0" standalone="yes"?> <Paper uid="A97-1008"> <Title>An Evaluation of Strategies for Selective Utterance Verification for Spoken Natural Language Dialog</Title> <Section position="3" start_page="41" end_page="42" type="metho"> <SectionTitle> 2 Selective Verification of Questionable User Inputs </SectionTitle> <Paragraph position="0"> Every system that uses natural language understanding will sometimes misunderstand its input.</Paragraph> <Paragraph position="1"> Misunderstandings can arise from speech recognition errors or inadequacies in the language grammar, or they may result from an input that is ungrammati-Spoken: i want to fix this circuit Recognized: power a six a circuit Spoken: the one is flashing for a longer period of time Recognized: one is flashing forth longer in a time Spoken: there is no wire on connector one zero four Recognized: stays know wire i connector one zero for cal or ambiguous. Here we focus on misunderstandings caused by speech recognition errors. Examples of misrecognized inputs from interacting with the Circuit Fix-It Shop are given in figure 1. One method for reducing the number of misunderstandings is to require the user to verify each utterance by either speaking every utterance twice, or confirming a word-by-word read back of every utterance (e.g., (Baber and Hone, 1993)). Such verification is effective at reducing errors that result from word misrecognitions, but does nothing to reduce misunderstandings that result from other causes. Furthermore, verification of all utterances can be needlessly wearisome to the user, especially if the system is working well.</Paragraph> <Paragraph position="2"> A better approach is to have the system verify its interpretation of an input only under circumstances where the accuracy of its interpretation is seriously in doubt, or correct understanding is essential to the success of the dialog. The verification is accomplished through the use of a verification subdialog---a short sequence of utterances intended to confirm or reject the hypothesized interpretation. The following example of a verification subdialog illustrates the idea.</Paragraph> <Paragraph position="3"> Computer: What is the switch position when the LED is off7 User: Up.</Paragraph> <Paragraph position="4"> Computer: Did you mean to say that the switch is up?</Paragraph> <Section position="1" start_page="41" end_page="42" type="sub_section"> <SectionTitle> User : Yes. </SectionTitle> <Paragraph position="0"> Notable features of such verification subdialogs include the following.</Paragraph> <Paragraph position="1"> * Verification is selective. A verification subdialog is initiated only if it is believed that the overall performance and accuracy of the dialog system will be improved. In this way, the dialog system responds much as a person would.</Paragraph> <Paragraph position="2"> * Verification is tunable. The propensity of the system to verify can be adjusted so as to pro- null vide any required level of speech understanding accuracy.</Paragraph> <Paragraph position="3"> * Verification operates at the semantic level. The system verifies an utterance's meaning, not its syntax. This helps overcome misunderstandings that result from inadequacies in the language model, or ungrammatical or ambiguous inputs.</Paragraph> <Paragraph position="4"> Two important definitions concerning selective verification are the following. An under-verification is defined as the event where the system generates a meaning that is incorrect but not verified. An over-verification occurs when a correct meaning is verified. The example just given is an example of an over-verification. The goal of any algorithm for selective utterance verification is to minimize the rate of under-verifications while also holding the rate of over-verifications to as low a value as possible. That is, the goal is to only verify utterances that need verifying, and to verify as many of these as possible. In section 4 we report on the results of tests of various strategies for deciding when to engage in verification subdialogs within a specific dialog environment, the Circuit Fix-It Shop. In order to understand the strategies used, an overview of this environment must first be presented.</Paragraph> </Section> </Section> <Section position="4" start_page="42" end_page="44" type="metho"> <SectionTitle> 3 Dialog Environment: The Circuit </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="42" end_page="42" type="sub_section"> <SectionTitle> Fix-It Shop 3.1 General Characteristics </SectionTitle> <Paragraph position="0"> The data used in this study were collected in experimental trials conducted with &quot;The Circuit Fix-It Shop,&quot; a spoken NL dialog system constructed in order to test the effectiveness of an integrated dialog processing model that permits variable initiative behavior as described in (Smith and Hipp, 1994) and (Smith, Hipp, and Biermann, 1995). The implemented dialog system assists users in repairing a Radio Shack 160 in One Electronic Project Kit. The particular circuit being used causes the Light-Emitting Diode (LED) to alternately display a one and seven. The system can detect errors caused by missing wires as well as a dead battery. Speech recognition is performed by a Verbex 6000 running on an IBM PC. To improve speech recognition performance we restrict the vocabulary to 125 words.</Paragraph> <Paragraph position="1"> A DECtalk DTCO1 text-to-speech converter is used to provide spoken output by the computer.</Paragraph> <Paragraph position="2"> After testing system prototypes with a few volunteers, eight subjects used the system during the formal experimental phase. After a warmup session where the subject trained on the speech recognizer and practiced using the system, each subject participated in two sessions where up to ten problems were attempted. Subjects attempted a total of 141 dialogs of which 118 or 84% were completed successfully. 1 The average speech rate by subjects was 2.9 sentences per minute; the average task completion times for successful dialogs were 6.5 minutes. An excerpt from an actual interaction with the system is given in figure 2. 2 The words in parentheses represent the actual sequence of words that the speech recognizer sent to the dialog system for analysis. As can be seen from the example, the system usually understood the user utterance (but not always). We next describe two features of the system that were useful in the interpretation process: (1) error-correcting parsing; and (2) dialog expectation. In section 4 we will see how these features assist in deciding when to engage the user in a verification subdialog.</Paragraph> </Section> <Section position="2" start_page="42" end_page="43" type="sub_section"> <SectionTitle> 3.2 Overcoming Misrecognition by Error-Correcting Parsing </SectionTitle> <Paragraph position="0"> The system was able to find the correct meaning for 81.5% of the more than 2800 input utterances even though only 50% of these inputs were correctly recognized word for word by use of an error-correcting parser that uses a dynamic programming approach similar to (Ney, 1991) to compute the best n parses for the input. What constitutes &quot;best&quot; is determined by a cost matrix for the possible words in the vocabulary and the given grammar. The cost matrix defines the cost for inserting or deleting words as well as the cost for a word substitution when such substitutions are allowed. The intent is to permit substitutions for words which sound very similar, such as &quot;do&quot; and &quot;two/to/too,&quot; words that are likely to 1 Of the 23 dialogs which were not completed, misunderstandings due to misrecognition were the cause in 13 of these failures. There were a variety of causes for the failure in the other 10 dialogs, ranging from inadequate grammar coverage to subject error in connecting wires.</Paragraph> <Paragraph position="1"> be confused by the speech recognizer. The parser performs insertions, deletions, and substitutions in order to transform the input into a grammatical utterance. With each &quot;grammatical&quot; utterance is associated a parse cost (PC), which is the sum of the costs of each insertion, deletion, and substitution required for the transformation. For each of the best n parses, an expectation cost (EC) is also produced according to how likely the input is to occur according to the expectations. The total cost of a parse is a weighted sum of PC and EC. The values for these weights and costs have been assigned heuristically.</Paragraph> <Paragraph position="2"> Figure 3 shows some of the utterances successfully handled by the implemented system during the experiment. For instance, example 1 is transformed into a grammatical utterance by substituting the phonetically similar word &quot;six&quot; for &quot;fix,&quot; and &quot;and&quot; for &quot;can.&quot; This example would have a relatively low parse cost indicating the system has high confidence in its interpretation. On the other hand, example 3 requires a large number of costly insertions and deletions, indicating a lack of confidence in the quality of the interpretation. As will be seen in section 4, this difference in scores is an important component in deciding when to engage in verification subdialogs.</Paragraph> </Section> <Section position="3" start_page="43" end_page="44" type="sub_section"> <SectionTitle> 3.3 Expectation in Dialog </SectionTitle> <Paragraph position="0"> An important phenomenon that is exploited in computational models of language is that of expectation (Allen, 1995). Based on the current context, there is an expectation of what is to come next. These expectations can be for phonemes, morphemes, words, or meanings. For example, if we hear the word &quot;the&quot; we expect the next word to be either an adjective or noun. When the computer asks, &quot;Is the knob position at 10T', we have greatest expectation for a response of either &quot;yes&quot; or &quot;no&quot;, lesser expectation for a sentence answer such as &quot;The knob position is five.&quot;, and still less expectation for a clarification question or comment such as &quot;Which knob?&quot;, &quot;Where is the knob?&quot;, or &quot;I do not see it.&quot; Consider the following dialog sample.</Paragraph> <Paragraph position="1"> i. C: Is there a wire between connector 84 and connector 99? 2. U: It is not there.</Paragraph> <Paragraph position="2"> 3. C: Add a wire between connector 84 and connector 99.</Paragraph> <Paragraph position="3"> 4. U: I need help.</Paragraph> <Paragraph position="4"> 5. C: Obtain a wire.</Paragraph> <Paragraph position="5"> 6. U: Okay.</Paragraph> <Paragraph position="6"> The system must use the established dialog context in order to properly interpret every user utterance as follows.</Paragraph> <Paragraph position="7"> I. C: This is the Circuit Fix-It Shop. How may I help you? 2. U: I want to fix this circuit. (power a six a circuit) 3. C: I do not understand. Please repeat what you said. 4. U: I want to fix this circuit (i want a fix a circuit) 5. C: What is the ID of the circuit? 6. U: RSl11. (rs111) 7. C: I am familiar with that circuit. The LED is supposed to be displaying alternately flashing one and seven.</Paragraph> <Paragraph position="8"> 8. U: The LED is displayin E nothing. (be led a displaying nothing) 9. C: What is the switch at when the LED is off? 10. U: The switch is down. (the switch yes down) 11. C: Put the switch up.</Paragraph> <Paragraph position="9"> 12. U: Done. (done) 13. C: Put the knob to one zero.</Paragraph> <Paragraph position="10"> 14. U: The knob is at one zero. (the knob and add one zero) 15. C: What is the LED displaying? 16. U: The LED is displaying a one and seven at the same time (be led a display+-ng one an seven at the same time) 17. C: Put the knob to zero.</Paragraph> <Paragraph position="11"> 18. U: Done. (done) 19. C: What is the LED displaying? 20. U: The same thing. (the same same) 21. C: Did you mean to say the LED is displaying the same thing? 22. U: Yes. (yes) * Utterance 2: to determine what &quot;it&quot; refers to (i.e., the wire from 84 to 99).</Paragraph> <Paragraph position="12"> * Utterance 4: to determine what the user needs help with (i.e., adding the wire).</Paragraph> <Paragraph position="13"> * Utterance 6: to determine whether &quot;okay&quot; denotes confirmation or comprehension (i.e., confirmation that the wire has been obtained). Effective use of expectation is necessary for constraining the search for interpretations and achieving efficient processing of NL inputs. This is particularly crucial in spoken NL diMog, where speakers expect fast response times (Oviatt and Cohen, 1989).</Paragraph> <Paragraph position="14"> The system model of expectations is similar to that of (Young et al., 1989) in that we predict the meanings of possible user responses based on the current dialog goal. The details of the system model can be found in (Smith and Hipp, 1994). Here we review the key aspects that are exploited in a context-dependent strategy for verification. We define expectations based on an abstract representation of the current task goal. For example, goal(user, ach(prop( Obj, Prop Narne, PropValue) ) ) 3 denotes the goal that the user achieve the value (PropValue) for a particular property (PropName), of an object (Obj). The specific values for Obj, PropName, and PropValue are filled in according to the current goal. For example, the goal of setting the switch position to up may be represented as goal(user, ach (prop( switch, position, up))) while the goal of observing the knob's color would be goal(user, obs(prop(knob, color, PropYalue) ) ) where PropValue is an uninstantiated variable whose value should be specified in the user input. General expectations for the meaning of user responses to a goal of the form goal(user, ach(prop(...))) include the following: * A question about the location of Obj.</Paragraph> <Paragraph position="15"> * A question about how to do the action.</Paragraph> <Paragraph position="16"> * An assertion that Obj now has the value PropValue for property PropName.</Paragraph> <Paragraph position="17"> representation used in the system as described in (Smith and Hipp, 1994).</Paragraph> <Paragraph position="18"> Even when using error-correcting parsing and dialog expectations, the Circuit Fix-It Shop misunderstood 18.5% of user utterances during the experimental testing. We now turn our attention to an empirical study of strategies for selective utterance verification that attempt to select for verification as many of the misunderstood utterances as possible while minimizing the selection of utterances that were understood correctly. These strategies make use of information obtainable from dialog expectation and the error-correcting parsing process.</Paragraph> </Section> </Section> <Section position="5" start_page="44" end_page="45" type="metho"> <SectionTitle> 4 Evaluating Verification Strategies </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="44" end_page="45" type="sub_section"> <SectionTitle> 4.1 Strategy 1: Using Parse Cost Only </SectionTitle> <Paragraph position="0"> An enhancement to the Circuit Fix-It Shop described in (Smith and Hipp, 1994) allows for a verification subdialog only when the hypothesized meaning is in doubt or when accuracy is critical for the success of the dialog. The decision of whether or not a particular input should be verified is made by computing for each meaning a parser confidence score (a measure of how plausible the parser's output is--this measure is proportional to the inverse of the total cost (section 3.2) normalized for utterance length) and a verification threshold (a measure of how important the meaning is toward the success of the dialog--greater importance is denoted by a higher threshold). The decision rule for deciding when to initiate a verification subdialog is specified as follows: IF the Parser Confidence Score > the Verification Threshold THEN DO NOT engage in a verification subdialog</Paragraph> </Section> </Section> <Section position="6" start_page="45" end_page="45" type="metho"> <SectionTitle> ELSE </SectionTitle> <Paragraph position="0"> engage in a verification subdialog This basic capability for verification subdialogs was not available during the 141 dialog experiment. However, simulations run on the collected data raised the percentage of utterances that are correctly understood from 81.5% to 97.4%. 4 Unfortunately, besides improving understanding through verification of utterances initially misinterpreted, the system also verified 19.2% of the utterances initially interpreted correctly. An example would be asking, &quot;Did you mean to say the switch is up?&quot;, when that is what the user originally said. These over-verifications result in extraneous dialog, and if excessive, will limit usability.</Paragraph> <Section position="1" start_page="45" end_page="45" type="sub_section"> <SectionTitle> 4.2 Strategy 2&quot; Using Context Only </SectionTitle> <Paragraph position="0"> The previous decision rule for utterance verification focused exclusively on the local information about parsing cost and ignores dialog context. In that situation the over-verification rate was 19.2% while the 4Consequently, the under-verification rate is 2.6%.</Paragraph> <Paragraph position="1"> We say that an utterance is correctly understood if it is either correctly interpreted initially, or is an utterance for which the system will engage the user in a verification subdialog. It is of course possible that the verification subdialog may not succeed, but we have not yet assessed the likelihood of that and thus do not consider this possibility during the evaluation of the various strategies.</Paragraph> <Paragraph position="2"> desired property now exists (e.g., &quot;The switch is up.&quot;). or &quot;Done&quot;) and (2) assertion that the * learn(Fact)--learning a fact. The fact could concern a piece of state information (e.g., that the switch is located in the lower left portion of the circuit), that an action needs completing (e.g., &quot;Putting the switch up is desirable,&quot;), or that a certain property should or should not be true (e.g., there should be a wire between connectors 34 and 80). In all cases, one main expectation is an acknowledgment that the Fact is understood. In the case of an action completion or a property status, there is also a main expectation for either that the user completed the action (e.g., &quot;Done&quot; or &quot;The switch is up&quot;), or that the property status is verified (e.g., &quot;Wire connecting 34 and 80&quot;). under-verification rate was 2.6%. What about using a rule solely based on context? For each abstract task goal, we define a subset of the expectations as the main expectation. This subset consists of the expected meanings that denote a normal continuation of the task. Figure 4 lists these expectations for the major task goals of the model. For cooperative task-assistance dialog, making the assumption that the meaning of the user's utterance will belong to a very small subset of the expectations for each abstract goal allows us to define the following context-dependent decision rule for utterance verification.</Paragraph> <Paragraph position="3"> IF utterance in the Main Expectation THEN DO NOT engage in a verification subdialog</Paragraph> </Section> </Section> class="xml-element"></Paper>