File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/00/a00-2027_evalu.xml
Size: 12,942 bytes
Last Modified: 2025-10-06 13:58:32
<?xml version="1.0" standalone="yes"?> <Paper uid="A00-2027"> <Title>Evaluating Automatic Dialogue Strategy Adaptation for a Spoken Dialogue System</Title> <Section position="5" start_page="205" end_page="208" type="evalu"> <SectionTitle> 4 Results and Discussion </SectionTitle> <Paragraph position="0"> Based on the features described above, we compared MIMIC and the control systems, MIMIC-SI and MIMIC-MI, along three dimensions: performance features, in which comparisons were made using previously proposed features relevant to system performance (e.g., (Price et al., 1992; Simpson and Fraser, 1993; Danieli and Gerbino, 1995; Walker et al., 1997)); discourse features, in which comparisons were made using characteristics of the resulting dialogues; and initiative distribution, where initiative characteristics of all dialogues involving MIMIC from both experiments were examined.</Paragraph> <Section position="1" start_page="205" end_page="206" type="sub_section"> <SectionTitle> 4.1 Performance Features </SectionTitle> <Paragraph position="0"> For our performance evaluation, we first applied a three-way analysis of variance (ANOVA) (Cohen, 1995) to each feature using three factors: system version, order, neers, or linguists, and none had prior knowledge of MIMIC.</Paragraph> <Paragraph position="1"> SWe used the exact same set of tasks rather than designing tasks of similar difficulty levels because we intended to compare all available features between the two system versions, including ASR word error rate, which would have been affected by the choice of movie/theater names in the tasks.</Paragraph> <Paragraph position="2"> 6Although the vast majority of tasks were completed in one call, some subjects, when unable to make progress, did not change strategies as in (41)-(49) in Figure 3; instead, they hung up and started the task over.</Paragraph> <Paragraph position="3"> and task difficulty. 7 If no interaction effects emerged, we compared system versions using paired sample t-tests. 8 Following the PARADISE evaluation scheme (Walker et al., 1997), we divided performance features into four groups: For both experiments, the ANOVAs showed no interaction effects among the controlled factors. Tables l(a) and l(b) summarize the results of the paired sample t-tests based on performance features, where features that differed significantly between systems are shown in italics. 9 These results show that, when compared with either tiative and automatic adaptation capabilities. We assess these effects based on comparisons between system version when no interaction effects emerged from the ANOVA tests using the factors system version, order, and task difficulty. Effects based on system order and task difficulty alone are beyond the scope of this paper.</Paragraph> <Paragraph position="4"> control system, users were more satisfied with MIMIC tdeg and that MIMIC helped users complete tasks more efficiently. Users were able to complete tasks in fewer turns and in a more timely manner using MIMIC.</Paragraph> <Paragraph position="5"> When comparing MIMIC and MIMIC-MI, dialogues involving MIMIC had a lower timeout rate. When MIMIC detected cues signaling anomalies in the dialogue, it adapted strategies to provide assistance, which in addition to leading to fewer timeouts, saved users time and effort when they did not know what to say. In contrast, users interacting with MIMIC-MI had to iteratively reformulate questions until they obtained the desired information from the system, leading to more timeouts (see (41)-(49) in Figure 3). However, when comparing MIMIC and MIMIC-SI, even though users accomplished tasks more efficiently with MIMIC, the resulting dialogues contained more timeouts. As opposed to MIMIC-SI, which always prompted users for one piece of information at a time, MIMIC typically provided more open-ended prompts when the user had task initiative. Even though this required more effort on the user's part in formulating utterances and led to more timeouts, MIMIC quickly adapted strategies to assist users when recognized cues indicated that they were having trouble.</Paragraph> <Paragraph position="6"> To sum up, our experiments show that both MIMIC's mixed initiative and automatic adaptation aspects resulted in better performance along the dialogue efficiency and system usability dimensions. Moreover, its adaptation capabilities contributed to better performance in terms of dialogue quality. MIMIC, however, did not contribute to higher performance in the task success dimension. In our movie information domain, the tasks were sufficiently simple; thus, all but one user in each experiment achieved a 100% task success rate.</Paragraph> </Section> <Section position="2" start_page="206" end_page="207" type="sub_section"> <SectionTitle> 4.2 Discourse Features </SectionTitle> <Paragraph position="0"> Our second evaluation dimension concerns characteristics of resulting dialogues. We analyzed features of user utterances in terms of utterance length and cues observed and features of system utterances in terms of dialogue acts. For each feature, we again applied a three-way ANOVA test, and if no interaction effects emerged, we performed a paired sample t-test to compare system versions. null The cues detected in user utterances provide insight into both user intentions and system capabilities. The cues that MIMIC automatically detects are a subset of those discussed in (Chu-Carroll and Brown, 1998): il * TakeOverTask: triggered when the user provides more information than expected; an implicit indication that the user wants to take control of the ldegThe range of user satisfaction scores was 8-40 for experiment one and 9-45 for experiment two.</Paragraph> <Paragraph position="1"> l t A subset of these cues corresponds loosely to previously proposed evaluation metrics (e.g., (Danieli and Gerbino, 1995)). However, our system automatically detects these features instead of requiring manual annotation by experts.</Paragraph> <Paragraph position="2"> problem-solving process.</Paragraph> <Paragraph position="3"> * NoNewlnfo: triggered when the user is unable to make progress toward task completion, either when the user does not know what to say or the ASR engine fails to recognize the user's utterance.</Paragraph> <Paragraph position="4"> * lnvalidAction/InvalidActionResolved: triggered when the user utterance makes an invalid as- null sumption about the domain and when the invalid assumption is corrected, respectively.</Paragraph> <Paragraph position="5"> * AmbiguousAction/AmbiguousActionResolved: triggered when the user query is ambiguous and when the ambiguity is resolved, respectively.</Paragraph> <Paragraph position="6"> Tables 2(a) and (b) summarize the results of the paired sample t-tests based on user utterance features where features whose numbers of occurrences were significantly different according to system version used are shown in italics. 12 Table 2(a) shows that users expected the system to adapt its strategies when they attempted to take control of the dialogue. Even though MIMIC-SI did not behave as expected, the users continued their attempts, resulting in significantly more occurrences of TakeOverTask in dialogues with MIMIC-SI than with MIMIC. Furthermore, the average sentence length in dialogues with MIMIC was only 1.5 words per turn longer than in dialogues with MIMIC-SI, providing further evidence that users ~2Since system dialogue acts are often selected based on cues detected in user utterances, we only discuss results of our user utterance feature analysis, using dialogue act analysis results as additional support for our conclusions.</Paragraph> <Paragraph position="7"> preferred to provide free-formed queries, regardless of system version used.</Paragraph> <Paragraph position="8"> Table 2(b) shows that MIMIC was more effective at resolving dialogue anomalies than MIMIC-MI. More specifically, there were significantly fewer occurrences of NoNewlnfo in dialogues with MIMIC than with MIMIC-MI. In addition, while the number of occurrences of AmbiguousAction was not significantly different for the two systems, the number that were resolved (AmbiguousActionResolved) was significantly higher in interactions with MIMIC than with MIMIC-MI. Since NoNewlnfo and AmbiguousAction both prompted MIMIC to adapt strategies and, as a resuit, provide additional useful information, the user was able to quickly resolve the problem at hand. This is further supported by the higher frequency of the system dialogue act GiveOptions in MIMIC (p=0), which provides helpful information based on dialogue context.</Paragraph> <Paragraph position="9"> In sum, the results of our discourse feature analysis further confirm the usefulness of MIMIC's adaptation capabilities. Comparisons with MIMIC-SI provide evidence that MIMIC's ability to give up initiative better matched user expectations. Moreover, comparisons with MIMIC-MI show that MIMIC's ability to opportunistically take over initiative resulted in dialogues in which anomalies were more efficiently resolved and progress toward task completion was more consistently made.</Paragraph> </Section> <Section position="3" start_page="207" end_page="208" type="sub_section"> <SectionTitle> 4.3 Initiative Analysis </SectionTitle> <Paragraph position="0"> Our final analysis concerns the task initiative distribution in our adaptive system in relation to the features previously discussed. For each dialogue involving MIMIC, we computed the percentage of turns in which MIMIC had task initiative and the correlation coefficient (r) between the initiative percentage and each performance/discourse feature. To determine if this correlation was significant, we performed Fisher' s r to z transform, upon which a conventional Z test was performed (Cohen, 1995).</Paragraph> <Paragraph position="1"> Tables 3(a) and (b) summarize the correlation between the performance and discourse features and the percentage of turns in which MIMIC has task initiative, respectively. 13 Again, those correlations which are statistically significant are shown in italics. Table 3(a) shows a strong positive correlation between task initiative distribution and the number of user turns as well as the elapsed time of the dialogues. Although earlier results (Table l(a)) show that dialogues in which the system always had task initiative tended to be longer, we believe that this correlation also suggests that MIMIC took over task initiative more often in longer dialogues, those in which the user was more likely to be having difficulty. Table 3(a) further shows moderate correlation between task initiative distribution and ASR rejection rate as well as ASR word error rate. It is possible that such a correlation exists and Features (n=56) because ASR performance worsens when MIMIC takes over task initiative. However, in that case, we would have expected the results in Section 4.1 to show that the ASR rejection and word error rates for MIMIC-SI are significantly greater than those for MIMIC, which are in turn significantly greater than those for MIMIC-MI, since in MIMIC-SI the system always had task initiative and in MIMIC-MI the system never took over task initiative.</Paragraph> <Paragraph position="2"> To the contrary, Tables l(a) and l(b) showed that the differences in ASR rejection rate and ASR word error rate were not significant between system versions, and Table l(b) showed that ASR word error rate for MIMIC-MI was in fact quite substantially higher than that for MIMIC. This suggests that the causal relationship is the other way around, i.e., MIMIC's adaptation capabilities allowed it to opportunistically take over task initiative when ASR performance was poor.</Paragraph> <Paragraph position="3"> Table 3(b) shows that all cues are positively correlated with task initiative distribution. For AmbiguousAction, lnvalidAction, and NoNewlnfo, this correlation exists because observation of these cues contributed to MIMIC having task initiative. However, note that AmbiguousActionResolved has a stronger positive correlation with task initiative distribution than does AmbiguousAction, again indicating that MIMIC's adaptive strategies contributed to more efficient resolution of ambiguous actions.</Paragraph> <Paragraph position="4"> In brief, our initiative analysis lends additional support to the conclusions drawn in our performance and discourse feature analyses and provides new evidence for the advantages of MIMIC's adaptation capabilities.</Paragraph> <Paragraph position="5"> In addition to taking over task initiative when previously identified dialogue anomalies were encountered (e.g., detection of ambiguous or invalid actions), our analysis shows that MIMIC took over task initiative when ASR performance was poor, allowing the system to better constrain user utterances, t4</Paragraph> </Section> </Section> class="xml-element"></Paper>