XML Viewer - w04-3006

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-3006_metho.xml
Size: 21,574 bytes
Last Modified: 2025-10-06 14:09:28
<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-3006">
  <Title>Error Detection and Recovery in Spoken Dialogue Systems</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 MERCURY Error Recovery Strategy
</SectionTitle>
    <Paragraph position="0"> The MERCURY system, accessible via a toll free number2, provides information about flights available for over 500 cities worldwide. We have invested considerable effort into making MERCURY intuitive to use and robust in handling a wide range of different ways users might express their flight constraints and select the flights of the itinerary. A typical user begins by logging on, providing both his name and password, which allows the system to look up some personalized information such as the e-mail address and the preferred departure city. MERCURY's dialogue plan involves arranging a trip one leg at a time.</Paragraph>
    <Paragraph position="1"> Once the itinerary is fully specified, MERCURY offers to price the itinerary and, subsequently, to send a detailed record of the itinerary to the user via e-mail, which can then be forwarded to a travel agent for the actual booking.</Paragraph>
    <Paragraph position="2"> A critical aspect of flight dialogues is the successful communication of the source, destination, and date, all of which are susceptible to recognition error. MERCURY's default policy is to use implicit confirmation to communicate to the user its interpretation of his utterances. In the meantime, it monitors the evolution over time of these three critical attributes. When it detects odd behavior, it switches into a mode where keypad entry is solicited. The keypad entry is matched against existing hypotheses and, if a successful match is obtained, is assumed to be correct.</Paragraph>
    <Paragraph position="3"> Otherwise, a verbal confirmation subdialogue, soliciting a &amp;quot;yes/no&amp;quot; answer, is invoked.</Paragraph>
    <Paragraph position="4"> For source and destination, the system tabulates at each turn whether the attribute was inherited, repeated, or changed. If a change is detected after flights have already been retrieved, the system prompts for spoken confirmation of the surprising move, anticipating possible recognition error. After two consecutive turns where the user has either apparently repeated or replaced the departure or arrival city, the system requests the user to enter the city by spelling it using the telephone keypad. This strategy is also used if a substitution/repetition of the city is followed by an utterance that is not understood, or whenever the user explicitly requests to change the departure or arrival city. It turns out that MERCURY's 500 cities are uniquely identifiable through their keypad codes; however, if this were not the case, a follow-up disambiguation subdialogue could be arranged. This keypad mechanism also provides the opportunity to confirm whether the desired city is known or unknown.</Paragraph>
    <Paragraph position="5"> A similar process takes place for dates. If the user appears to repeat the date, without providing any other information, there is the suspicion that a misrecognized date has again been misrecognized the same way. In this case, the system tries to find an alternative hypothesis for the date by re-examining the N-best list of recognizer 21-877-MIT-TALK.</Paragraph>
    <Paragraph position="6"> hypotheses, and, in any case, also asks for user confirmation. As is the case for cities, the system invokes the keypad upon repeated date corrections.</Paragraph>
    <Paragraph position="7"> Figures 1 and 2 provide two examples of user dialogues involving keypad city entry. Figure 1 illustrates a dialogue where the conversation is clearly confused, and the system eventually takes the initiative to invite a keypad entry of the departure city. The user wanted to go to &amp;quot;Austin&amp;quot;, which the system misunderstood as &amp;quot;Boston&amp;quot;. This particular user had a default departure city of &amp;quot;Boston&amp;quot;, which caused the system to suppose that the user had requested a pragmatically unreasonable flight from &amp;quot;Boston&amp;quot; to &amp;quot;Boston&amp;quot;. The user's follow-up fragment, &amp;quot;Austin, Texas&amp;quot;, was correctly understood, but misinterpreted as the departure city instead of the arrival city, leading to further confusion. It was only after the user had cleared up the confusion, with the complete utterance, &amp;quot;I would like to fly from Boston, Massachusetts to Austin, Texas,&amp;quot; that the system was finally on the right track, but by this point, it had identified difficulty with the source, reacting by launching a keypad entry request, with subsequent resolution.</Paragraph>
    <Paragraph position="8"> Figure 2 shows an example subdialogue where the destination city was successfully entered using the telephone keypad, based on an explicit request on the part of the user to change the destination. Interestingly, the user delayed the correction until the system invited him to change any constraint that was already specified. This particular user probably believed that she was required to respond to the prompts, although it is conceivable that the user's delayed response was due to inattentiveness.</Paragraph>
    <Paragraph position="9"> This dialogue thus reveals some of the potential difficulties encountered due to users' false assumptions about the system's behavior.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 MERCURY Analysis
</SectionTitle>
    <Paragraph position="0"> We have been collecting MERCURY data over the telephone for the past several years (Seneff and Polifroni, 2000), involving user interactions with the system to make flight reservations. In examining these dialogues, we have come to the realization that, while keypadding the date (as a four digit numeric code for month and day) seems to be intuitive to users and therefore an effective mechanism for correcting misunderstandings, the situation is far less effective in the case of city names.</Paragraph>
    <Paragraph position="1"> A detailed analysis has thus been performed on all instances where the system requested a source or destination entry via the keypad, and the user's reactions to the requests were observed and quantified. We found that this strategy, when users were compliant, was generally successful for determining the user's desired source or destination. For example, if the user were to enter &amp;quot;3387648&amp;quot;, the system would understand &amp;quot;DETROIT&amp;quot;, and the dialogue would smoothly continue.</Paragraph>
    <Paragraph position="2"> In addition to many successful responses, however, several errorful responses were also observed, including misspelled words (e.g., &amp;quot;TEIPEI&amp;quot; for &amp;quot;TAIPEI&amp;quot;), out-of-vocabulary words (e.g., &amp;quot;DOMINICA&amp;quot;), or a string of valid references that could not be resolved as a single place name (e.g., &amp;quot;NEWYORKJFKNYC&amp;quot; for &amp;quot;New York's Kennedy Airport&amp;quot;). A user time-out or hang-up was also common, and constituted a significant number of responses.</Paragraph>
    <Paragraph position="3"> A total of 172 instances were observed in which the system prompted users to enter a source or destination via the keypad. The number of occurrences is rather low since this solicitation was only activated as a last resort.</Paragraph>
    <Paragraph position="4"> The system then entered a state where speech was not an option. The users' responses to these prompts are summarized in Table 1. Most surprising is that nearly half of the time, the user did not even attempt to use the keypad.</Paragraph>
    <Paragraph position="5"> In only 88 of the cases did the user actively enter a keypad code. The user let a time-out occur in 50 cases, and hung up the telephone in an additional 34 cases.</Paragraph>
    <Paragraph position="6">  to enter a source or destination using the telephone keypad. This attempt rate of 51.1% is significantly lower than originally hoped. Even within the 88 compliant cases, the results are disappointing, as shown in Table 2. In 61 cases, the keypad sequence entered by the user corresponded to a valid city or airport name. Most of these were known to the system and were processed successfully. The remaining 30.7% of attempts consisted of misspellings (such as a double tap on a key, substituting the number '0' for the letter 'o', or terminating with '*' instead of '#') or apparent garbage.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Description Count Percentage
</SectionTitle>
      <Paragraph position="0"> source or destination city or airport name using the telephone keypad after being prompted by the system.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Discussion
</SectionTitle>
      <Paragraph position="0"> Our results suggest that the strategy of prompting for keypad entry of questionable parameters shows potential for recovering from situations in which the system is confused about what the user has said. We believe that such recovery can contribute to successful dialogue completion, as well as elevating the user's tolerance level. Nevertheless, our results also pose two questions that need to be addressed: why do some users' attempts at keypad entry contain errors, and, more importantly, why do some users not even attempt keypad entry? It is not possible to know why an individual user was unable to enter a valid keypad sequence; we had no mechanism to interview users about their behavior. We can, however, speculate that the errorful sequences were due to the non-intuitive nature of spelling with a telephone keypad, a user's unfamiliarity with the spelling of a given word, typos, or a user's confusion as to what qualified as an acceptable entry (e.g., Are abbreviations and nicknames allowed?).</Paragraph>
      <Paragraph position="1"> We must also acknowledge the fact that what qualifies as a valid keypad sequence depends on the spelling correction capabilities of the system. Even a simple spelling checker (not utilized during the MERCURY data collection) could potentially allow the system to make sense of an errorful keypad sequence.</Paragraph>
      <Paragraph position="2"> In the case of a time-out, it is difficult to know what each user was thinking as he waited. It is likely that the user was hoping for a return to speech mode after the time-out. The user may have hesitated for fear of sending the system down an even more divergent path. It is also possible that users were inattentive when the system instructed them to terminate with the pound key, and that they therefore entered the entire city, but without a termination code. Clearly a strategic modification to automatically generate a '#' after a significant pause might help reduce this type of error.</Paragraph>
      <Paragraph position="3"> The reason for a hang-up is more obvious, given the dialogue context. For example, if the user had repeatedly said that he wanted to fly to Anchorage and the system had already hypothesized three other cities, it is understandable that he would have hung up in frustration.</Paragraph>
      <Paragraph position="4"> The telephone keypad would seem to be a very practical mode of information entry given its physical accessibility and limited ambiguity per key. This small set of data in the flight domain, however, suggests that it is confusing, annoying, or simply intimidating to many users.</Paragraph>
      <Paragraph position="5"> The next challenge, then, is to utilize a similar error recovery strategy, but to adopt a different mode of information entry, one that is more intuitive and less intimidating. We discuss such an option in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Spoken Spelling
</SectionTitle>
    <Paragraph position="0"> Allowing a user to spell a word has several benefits, including maintaining a single mode of communication (i.e., speech), as well as being less taxing, more efficient, and more intuitive. Our goal is to make the user feel confident that spelling a city name is a plausible request and that it can be the most effective path to task completion.</Paragraph>
    <Paragraph position="1"> Undeniably, spelling recognition comes with its own set of problems, especially misrecognition of the spoken letters. One way to minimize such errors is to incorporate limited spelling checking, such as allowing a single insertion, deletion, or substitution per word. For example, a spelling sequence recognized as &amp;quot;T E N V E R&amp;quot; could be mapped to &amp;quot;D E N V E R&amp;quot; as the closest match in the database. Obviously, a trade-off exists where overgenerous spelling correction could lead to a false hypothesis. A great challenge in developing conversational systems is that dialogue strategies can only evolve through extensive experimentation, which requires a large amount of data, particularly for situations that occur rarely in actual dialogues. To expedite development and evaluation of the recovery strategy, we decided to make use of simulated user data to artificially continue MERCURY dialogues beyond the point where the system had originally asked for a keypad entry, as described in the next section.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 User Simulation
</SectionTitle>
    <Paragraph position="0"> To streamline exploration of alternative dialogue strategies for error recovery, we have implemented a simulated user that speaks and spells a city name using DECTalk.</Paragraph>
    <Paragraph position="1"> A block diagram of our simulated user system is shown in Figure 3. Each synthesized waveform3 contains a pronunciation of the city name that a user was trying to communicate in the original dialogue, immediately followed by a spoken spelling of that city name (e.g., &amp;quot;Boston B O S T O N&amp;quot;). The waveform is passed to a first stage speech recognizer, which treats the spoken word as an unknown word and proposes an N-best list of hypothesized spellings for the synthesized letter sequence. For speech recognition, we use the SUMMIT framework (Glass et al., 1996), and the unknown word is modeled according to techniques described in (Bazzi and Glass, 2002).</Paragraph>
    <Paragraph position="2"> Following the first stage recognition, a two-stage matching process first consults a list of &amp;quot;cities in focus&amp;quot; that were extracted as hypotheses from the original user's final utterance before the keypad turn. Subsequently, if a match or conservative partial match is not found from the short list, a large database of 17,000 city and state names is consulted for a match or a partial match. In this case a confirmation subdialogue ensues.</Paragraph>
    <Paragraph position="3"> If a match is found, a geography server determines whether the name is ambiguous. If so, a disambiguating item (e.g., state name) is requested by the dialogue manager. The simulated user then randomly chooses from 3While DECTalk speech is artificial, we have not explicitly trained our recognizer on it, and thus we argue that it can serve as an effective stand-in for real human speech.</Paragraph>
    <Paragraph position="4"> a list of candidate state names provided by the geography server. This utterance is currently also processed as a speak-and-spell utterance, mainly because we are interested in obtaining more data on the performance of our speak-and-spell system.</Paragraph>
    <Paragraph position="5"> If no match is found in either the short list or the external lexicon of known city names, another recognition cycle is initiated, in which the phonetic content of the spoken word is used to enhance the performance of the spelling recognizer, following procedures described in (Chung et al., 2003). A letter-to-sound model is used to map from a graph of letter hypotheses proposed by the first stage recognizer to their corresponding plausible pronunciations, using techniques described in (Seneff et al., 1996). The final set of hypotheses is obtained by merging hypotheses produced from both halves of the user utterance. Once again, both the short list and the large lexicon are searched for a match.</Paragraph>
    <Paragraph position="6"> The idea is that this second stage should only be invoked upon failure, in order to reduce the amount of computation time required. An alternative strategy would be for the system to unconditionally execute a second recognition to obtain a potentially more correct hypothesis.</Paragraph>
    <Paragraph position="7"> Such a strategy, however, would increase the system's overall processing time.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Results and Discussion
</SectionTitle>
      <Paragraph position="0"> The simulation was performed on a total of 97 user utterances, all of which MERCURY had designated as trouble situations in the original dialogues. The utterances utilized are those for which the system's hypotheses contained city names, whether or not the user had actually mentioned a city name.</Paragraph>
      <Paragraph position="1"> The simulation results are shown in Table 3. Out of 97 problematic sources and destinations generated by the simulated user, 58 required disambiguation with a state name (e.g., &amp;quot;Boston in Georgia&amp;quot;). Therefore, 155 speak-and-spell utterances were ultimately passed through the synthesize-and-recognize simulation cycle. All but one of the state names were correctly recognized. This high performance is likely due to the correct state's guaranteed existence in the short list used by the spelling checker.</Paragraph>
      <Paragraph position="2"> Our algorithm dictates that a second pass, which integrates the spoken name portion of the waveform with letter-to-sound hypotheses derived from the spoken spelling portion, be omitted if a match is found in the first pass. One question to ask is whether the system is being overconfident in this strategy. The results in the table support the notion of using the second pass sparingly. In 68 cases, the system was sufficiently confident with its hypothesized city after the first recognition pass to omit the second pass; it made no errors in these decisions.</Paragraph>
      <Paragraph position="3"> About a third of the time (29 cases), the system, finding no match, initiated a second pass to incorporate pro- null showing the number of correct cities hypothesized by the system, after each of two recovery passes. For pass 2, a match was found on the short list or in the geographic database. No match resulted in resorting to the best recognition hypothesis.</Paragraph>
      <Paragraph position="4"> nunciation information. There were two instances where the second-pass hypothesized city was found on the short list of focus cities from the original user utterance; both were correct. For the remainder, the large database was consulted. The system proposed the correct city in nearly 79% of the cases. After failing to find any match, the system attempted its last resort of proposing the best hypothesis from the second-stage recognizer. Not surprisingly, the system determined the correct city name in only 39% of these cases. Nevertheless, this percentage suggests that it is certainly better to perform the confirmation rather than to simply tell the user that the city is unknown, given that the recognizer may be correct without the aid of any external lexicon.</Paragraph>
      <Paragraph position="5"> The majority of incorrect city hypotheses were due to limitations in the spelling checker and the absence of international names in the geographic database. The current spelling checker, while quite powerful, allows only a single insertion, deletion, or substitution of a letter, or a swap of two letters. We believe that a more robust spelling checker can minimize many of these errors.</Paragraph>
      <Paragraph position="6"> The system's performance in hypothesizing the correct candidate for nearly 89% of the problematic city names is encouraging. These results show that this error recovery strategy is largely successful in the synthesize-and-recognize user simulation cycle. The simulated results are, of course, biased in that the simulated user was co-operative with all system requests. The results of the MERCURY analysis in Section 4 show that an errorful or nonexistent response from a user is a very likely possibility. The installation of this strategy in a real system will require that user behavior be carefully monitored.</Paragraph>
      <Paragraph position="7"> Although the prospects for the speak-and-spell input mode are promising, we would not want to entirely abandon the use of the telephone keypad. It has been and remains a convenient and effective means by which to spell words. A more appropriate use of the keypad could be as a back-off strategy after the spoken spelling has failed, or in very noisy environments, where speech would be nearly impossible. One advantage of the keypad is that, barring mistakes, the system can be confident that when '3' is pushed, one of the letters, 'D', 'E', or 'F', is intended. When combined with the spoken word being spelled, such keypad ambiguity can be reduced even further (Chung et al., 2003).</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML