File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/99/e99-1051_evalu.xml

Size: 4,315 bytes

Last Modified: 2025-10-06 14:00:34

<?xml version="1.0" standalone="yes"?>
<Paper uid="E99-1051">
  <Title>Robust and Flexible Mixed-Initiative Dialogue for Telephone Services</Title>
  <Section position="4" start_page="287" end_page="287" type="evalu">
    <SectionTitle>
3 EXPERIMENTAL RESULTS
</SectionTitle>
    <Paragraph position="0"> In order to test the improvements over our original system (described in (Alvarez et al., 1996)) we designed a simulated evaluation environment where the performance of the Speech Recognition Module (recognition rate) was artificially controlled.</Paragraph>
    <Paragraph position="1"> A Wizard of Oz simulation environment was designed to obtain different levels of recognition performance for a vocabulary of 1170 words: 96.4% word recognition rate for high performance and 80% for low performance. A pre-defined single fixed mixed-initiative strategy was used in all the cases.</Paragraph>
    <Paragraph position="2"> We used an annotated data base composed of 50 dialogues with 50 different novice users and 6 different simple telephone tasks in each dialogue: 25 dialogues were simulated using 94.6% recognition rate and 25 with 80%. Performance results were obtained using the PARADISE evaluation framework (Walker et al., 1998), determining the contributions of task success and dialogue cost to user satisfaction. Therefore as task success measure me obtained the Kappa coefficient while dialogue cost measures were based on the number of users turns. In this case it is important to point out that as each tested dialogue is composed of a set of six different tasks which have quantify different number of turns, the number of turns for each task was normalized to it's N(x) = ~+----~ score  and high ASR. And separately for each Group A and B, only in high ASR situation User satisfaction in Table 1 was obtained as a cumulative satisfaction score for each dialogue by summing the scores of a set of questions similar t,o those proposed in (Walker et al., 1998). The ANOVA for Kappa, the cost measure and user satisfaction demostrated a significant effect of ASR performance. As it could be predicted, we found that in all cases a low recognition rate corresponds to a dramatical decrease in the absolute number of suscessfully completed tasks and an important increase in the average number of utterances.</Paragraph>
    <Paragraph position="3"> However we also found that in high ASR situation the task success measure of Kappa was surprisingly low.</Paragraph>
    <Paragraph position="4"> A closer inspection of the dialogues in Table 1 revealed that this low performance under high ASR situations was due to the presence of two groups of users. A first group, Group A, showed a &amp;quot;fluent&amp;quot; interaction with the system, similar to the one supposed by the mixed-initiative strategy (for example, as an answer to the question of the system &amp;quot;do you want to do any other task?&amp;quot;, these users could answer something like &amp;quot;yes, I would like to send a message to John Smith&amp;quot;). While the other group of users, Group B, exibited a very restrictive interaction with the system (for example, a short answer &amp;quot;yes&amp;quot; for the same question). As a conclusion of this first evaluation we found that in order to increase the permormance of our baseline system, two major points should be addressed: a) robustness against recognition and parser errors, and b) more flexibility to be able to deal with different user models.</Paragraph>
    <Paragraph position="5"> Therefore we designed an adaptive strategy to adapt our dialogue manager to Group A or B of users and to High and Low ASR situations. The adaptation was done based on linear discrimination, as it is ilustrated in Figure 2, using both the average number of turns and recognition errors from the two first tasks in each dialogue.</Paragraph>
    <Paragraph position="6">  ASR situations and for both in low ASR.</Paragraph>
    <Paragraph position="7"> Table 2 shows mean results for each Group A and B of users for High ASR performance, and for all users in Low ASR situations. These results show a more stable behaviour of the system, that is, less difference in performance between users of Group A and Group B and, although to a lower extend, between high and low recognition rates.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML