File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1027_metho.xml
Size: 20,937 bytes
Last Modified: 2025-10-06 14:08:53
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1027"> <Title>The Tao of CHI: Towards Effective Human-Computer Interaction</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Studies on Human-Computer Dialogues </SectionTitle> <Paragraph position="0"> The first studies and descriptions of the particularities of dialogical human-computer interaction, then labeled as computer talk in analogy to baby talk by Zoeppritz (1985), focused - much like subsequent ones - on: a0 proving that a regular register for humans conversing with dialogue system exists, e. g., those of Krause (1992) and Fraser (1993), a0 describing the regularities and characteristics of that register, as in Kritzenberger (1992) or Darves and Oviatt (2002).</Paragraph> <Paragraph position="1"> The results of these studies clearly show that such a register exists and that its regularities can be replicated and observed again and again. In general, this work focuses on the question: what changes happen to human verbal behavior when they talk to computers as opposed to fellow humans? The questions which are not explicitely asked or studied are: a0 how does the computer's way of communicating affect the human interlocutor, a0 do the particulars of computer-human interaction help to explain why today's conversational dialogue systems are by and large unusable.</Paragraph> <Paragraph position="2"> In this paper we claim that this shift of perspective is of paramount importance, for example, to make sense of the phenomena observable during end-to-end evaluations of conversational systems. We designed our experiments and started our initial observations using one of the most advanced conversational dialogue research prototypes existing today, i. e., the SMARTKOM system (Wahlster et al., 2001). This system designed for intuitive multimodal interaction comprises a symmetric set of input and output modalities (Wahlster, 2003), together with an efficient fusion and fission pipeline (Wahlster, 2002). SMARTKOM features speech input with prosodic analysis, gesture input via infrared camera, recognition of facial expressions and their emotional states. On the output side SMARTKOM employs a gesturing and speaking life-like character together with displayed generated text and multimedia graphical output. It currently comprised nearly 50 modules running on a parallel virtual machine-based integration software called MULTIPLAT-FORM (Herzog et al., 2003). As such it is certainly among the most advanced multi-domain conversational dialogue systems.</Paragraph> <Paragraph position="3"> To the best of our knowledge, there has not been a single publication reporting a successful end-to-end evaluation of a conversational dialogue system with naive users. We claim that, given the state of the art of the dialogue management of today's conversational dialogue systems, evaluation trials with naive users will continue to uncover severe usability problems resulting in low task completion rates.1 Surprisingly, this occurs despite acceptable partial evaluation results. By partial results, we understand evaluations of individual components such as concerning the word-error rate of automatic speech recognition or understanding rates as conducted by Cox et al. (2000) or reported in Diaz-Verdejo et al. (2000). As one of the reasons for the problems thwarting task completion, Beringer (2003) points at the problem of turn overtaking, which occurs when users rephrase questions or make a second remark to the system, while it is still processing the first one. After such occurrences a dialogue becomes asynchronous, meaning that the system responds to the second last user utterance while in the user's mind that response concerns the last. Given the current state of the art regarding the dialogue handling capabilities of HCI systems, this inevitably causes dialogues to fail completely.</Paragraph> <Paragraph position="4"> We can already conclude from these informal findings that current state of the art conversational dialogue systems suffer from a) a lack of turn-taking strategies and dialogue handling capabilities as well as b) a lack of strategies for repairing dialogues once they become out of sync.</Paragraph> <Paragraph position="5"> In human-human interaction (HHI) turn-taking strategies and their effects have been studied for decades in unimodal settings from Duncan (1974) and Sack et al. (1974) to Weinhammer and Rabold (2003) as well as more recently in multimodal settings as in Sweetser (2003). Virtually no work exists concerning the turn-taking strategies that dialogue systems should pursue and how they effect human-computer interaction, except in special cases such as in Woodburn et al. (1991) for the case of conversational computer-mediated communication aids for the speech and hearing impaired or Shankar et al. (2000) for turn negotiation in text-based dialogue systems. The overview of classical HCI experiments and their results, given in Wooffitt et al. (1997), also shows that problems, such as turn-overtaking, -handling and -repairs , have not been addressed by the research community.</Paragraph> <Paragraph position="6"> In the following section we describe a new experimental paradigm and the first corresponding experiments tailored towards examining the effects of the computer's communicative behavior on its human partner. More specifically, we will analyze the differences in HHI and 1These problems can be diminished, however, if people have multiple sessions with the system and adapt to the respective system's behavior.</Paragraph> <Paragraph position="7"> HCI/CHI turn-taking and dialogue management strategies, which, in the light of the recent end-to-end evaluation results described above, constitutes a promising starting point for an examination of the effects of the computer's communicative behavior. The overall goal of analyzing these effects is, that future systems become usable by exhibiting a more felicitous communicative behavior. After reporting on the results of the experiments in Section 4, we highlight a set of hypotheses that can be drawn from them and point towards future experiments that need to be conducted to verify these hypotheses in Section 6.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Experiments </SectionTitle> <Paragraph position="0"> For conducting the experiments we developed a new paradigm for collecting telephone-based dialogue data, called Wizard and Operator Test (WOT), which contains elements of both Wizard-of-Oz (WoZ) experiments (Francony et al., 1992) as well as Hidden Operator Tests (Rapp and Strube, 2002). This procedure also represents a simplification of classical end-to-end experiments, as it is - much like WoZ experiments - conductible without the technically very complex use of a real conversational system. As post-experimental interviews showed, this did not limit the feeling of authenticity regarding the simulated conversational system by the human subjects (a0 ).</Paragraph> <Paragraph position="1"> The WOT setup consists of two major phases that begin after subjects have been given a set of tasks to be solved with the telephone-based dialogue system: a0 in Phase 1 the human assistant ( a1 ) is acting as a wizard who is simulating the dialogue system, much like in WoZ experiments, by operating a speech synthesis interface, a0 in Phase 2, which starts immediately after a system breakdown has been simulated by means of beeping noises transmitted via the telephone, the human assistant is acting as a human operator asking the subject to continue with the tasks.</Paragraph> <Paragraph position="2"> This setup enables to control for various factors. Most importantly the technical performance (e. g., latency times), the pragmatic performance (e. g., understanding vs. non-understanding of the user utterances) and the communicative behavior of the simulated systems can be adjusted to resemble that of state of the art dialogue systems. These factors can, of course, also be adjusted to simulate potential future capabilites of dialogue systems and test their effects. The main point of the experimental setup, however, is to enable precise analyses of the differences in the communicative behaviors of the various interlocutors, i. e., human-human, human-computer and computer-human interaction.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Technical Setup </SectionTitle> <Paragraph position="0"> During the experiment a0 and a1 were in separate rooms.</Paragraph> <Paragraph position="1"> Communication between both was conducted via telephone, i. e., for the user only a telephone was visible next to a radio microphone for the recording of the subject's linguistic expressions. As shown in Figure 1 the assistant/operator room featured a telephone as well as two computers - one for the speech synthesis interface and one for collecting all audio streams; also present were loudspeakers for feeding the speech synthesis output into the telephone and a microphone for the recording of the synthesis and operator output. With the help of an audio mixer all linguistic data were recorded time synchronously and stored in one audio file. The assistant/operator acting as the computer system communicated by selecting fitting answers for the subject's request from a prefabricated list covering the scope of the SMARTKOM repertoire of answers, which - despite the more conversational nature of the system, still does not include any kind of dialogue structuring or feedback particles. These responses were returned via speech synthesis through the telephone. Beyond that it was possible for the wizard to communicate over telephone directly with the subjects when acting as the human operator.</Paragraph> <Paragraph position="2"> thesized speech out of the loudspeakers into the operator room (left) phone to the subject room (right) and in Phase 2 directly via the phone between the humans.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 The Experiments </SectionTitle> <Paragraph position="0"> The experiments were conducted with an English setup, subjects and assistants in the United States of America and with a German setup, subjects and assistants in Germany. Both experiments were otherwise identical and in each 22 sessions were recorded. At the beginning of the WOT, the test manager told the subjects that they were testing a novel telephone-based dialogue system that supplies touristic information on the city of Heidelberg. In order to avoid the usual paraphrases of tasks worded too specifically, the manager gave the subjects an overall list of 20 very general touristic activities, such as visit museum or eat out, from which each subject had to pick six tasks which had to be solved in the experiment. The manager then removed the original list, dialed the system's number on the phone and exited from the room after handing over the telephone receiver. The subject was always greeted by the system's standard opening ply: Welcome to the Heidelberger tourist information system.</Paragraph> <Paragraph position="1"> How I can help you? After three tasks were finished (some successful some not) the assistant simulated the system's break down and entered the line by saying Excuse me, something seems to have happened with our system, may I assist you from here on and finishing the remaining three tasks with the subjects.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Results </SectionTitle> <Paragraph position="0"> The PARADISE framework (Walker et al., 1997; Walker et al., 2000) proposes distinct measurements for dialogue quality, dialogue efficiency and task success metrics. The remaining criterion, i. e., user satisfaction, is based on questionaries and interviews with the subjects and cannot be extracted (sub)automatically from logfiles. The analyses of the experiments described herein focus mainly on dialogue efficency metrics in the sense of Walker et al. (2000). As we will show below, our findings strongly suggest that a felicitous dialogue is not only a function of dialogue quality, but critically hinges on a minimal threshold of efficiency and overall dialogue management as well. While these criteria lie orthogonal to the Walker et al. (2000) criteria for measuring dialogue quality such as recognition rates and the like, we regard them to constitute an integral part of an aggregate view on dialogue quality and efficiency, herein referred to as dialogue felicity. For examining dialogue felicity we will provide detailed analyses of efficiency metrics per se as well as additional metrics for examining the number and effect of pauses, the employment of feedback and turn-taking signals and the amount of overlaps.</Paragraph> <Paragraph position="1"> The Data: The length of the dialogues was on average 5 minutes for the German (G) and 6 minutes for the English (E) sessions.2 The subjects featured approximately proportional mixtures of gender (25m,18f), age (12a0a2a1 71) and computer expertise. Table 1 shows the duration and turns per phase of the experiment.</Paragraph> <Paragraph position="2"> Measurements: First of all, we apply the classic Walker et al. (2000) metric for measuring dialogue efficiency, by calculating the number of turns over dialogue length. Figure 2 shows the discrepancy between the dialogue efficiency in Phase 1 (HCI) versus Phase 2 As this discrepancy might be accountable by latency times alone, we calculated the same metric with and without pauses. For these analyses pauses are very conservatively defined as silences during the conversation that exceeded one second. The German results are shown in Figure 4 and, as shown in Figure 5 we find the same patterns hold cross-linguistically in the English experiments. The overall comparison, given in Table 2, shows that - as one would expect - latency times severely decrease dialogue efficiency, but that they alone do not account for the difference in efficiency between human-human and human-computer interaction. This means that even if latency times were to vanish completely, yielding actual real-time performance, we would still observe less efficient dialogues in HCI.</Paragraph> <Paragraph position="3"> While it is obvious that the existing latency times increase the number and length of pauses of the computer interactions as compared to the human operator's interactions, there are no such obvious reasons why the number and length of pauses in the human subjects' interactions should differ in Phase 1 and Phase 2. However, as shown in Table 3, they do differ substantially.</Paragraph> <Paragraph position="4"> Next to this pause-effect, which contributes greatly to dialogue efficiency metrics by increasing dialogue length, we have to take a closer look at the individual turns and their nature. While some turns carry propositional information and constitute utterances proper, a significant number solely consists of specific particles used to exchange signals between the communicative partners or combinations thereof. We differentiate between dialogue-structuring signals and feedback signals in the sense of Yngve (1970). Dialogue-structuring signals such as hesitations like hmm or ah as well as expressions like well, yes, so - mark the intent to begin or end an utterances, make corrections or insertions. Feedback signalswhile sometimes phonetically alike - such as right, yes or hmm - do not express the intent to take over or give up the speaking role, but rather serve as a means to stay in contact with the speaker, which is why they are sometimes referred to as contact signals.</Paragraph> <Paragraph position="5"> Phase 1 and 2 German (HCI-G/HHI-G) and English (HCI-G/HCI-E) In order to be able to differentiate between the two, for example, between an agreeing feedback yes and a dialogue-structuring one, all dialogues were annotated manually. Half of the data were annotated by separate annotators, yielding an inter-annotator agreement of</Paragraph> <Paragraph position="7"> . The resulting counts for the user utterances in phase one and two are shown in Table 4. Not shown in Table 4 are the number of particles employed by the computer, since it is zero, and of the human operator in the HHI dialogues, as they are like those of his human interlocutor.</Paragraph> <Paragraph position="8"> Again, the findings for both German and English are comparable. We find that feedback particles almost vanish from the human-computer dialogues - a finding that corresponds to those described in Section 2. This linguistic behavior, in turn, constitutes an adaptation to the employment of such particles by that of the respective interlocutor. Striking, however, is that the human subjects still attempted to send dialogue structuring signals to the computer, which - unfortunately - would have been ignored by today's &quot;conversational&quot; dialogue systems.3 Particles structure particle feedback particle Before turning towards an analysis of this data we will examine the overlaps that occurred throughout the dialogues. Most overlaps in human-human conversation occur during turn changes with the remainder being feed-back signals that are uttered during the other interlocutor's turn (Jefferson, 1983). The results on measuring the amount of overlap in our experiments are given in Table 5. Overall the HHI dialogues featured significantly more overlap than the HCI ones, which is partly due to the respective presence and absence of feedback signals as well as due to the fact that in HCI turn takes are accompanied by pauses rather than immediate - overlapping hand overs.</Paragraph> <Paragraph position="9"> Lastly, our experiments yielded negative findings concerning differences in the type-token ratio (denoting the lexical variation of forms), speech production errors (false starts, repetitions etc.) and syntax. This means that there was no statistically significant difference in the linguistic behavior with respect to these factors. We regard this finding to strengthen our conclusions (see Section 6), that to emulate human syntactic and semantic behavior does not suffice to guarantee effective and therefore felicitous human computer interaction.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 An Analysis of Ineffective Computer- </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Human Interaction </SectionTitle> <Paragraph position="0"> The results presented above enable a closer look at dialogue efficiency as one of the key factors influencing overall dialogue felicity. As our experiments show, the difference between the human-human efficiency and that 3In the English data the subject's employment of dialogue structuring particles in HCI even slightly surpassed that of HHI. of the human-computer dialogues is not solely due to the computer's response times. There is a significant amount of white noise, for example, as users wait after the computer has finished responding. We see these behaviors as a result of a mismanaged dialogue. In many cases users are simple unsure whether the system's turn has ended or not and consequently wait much longer than necessary.</Paragraph> <Paragraph position="1"> The situation is equally bad at the other end of the turn taking spectrum, i. e., after a user has handed over the turn to the computer, there is no signal or acknowledgment that the computer has taken on the baton and is running along with it - regardless of whether the user's utterance is understood or not. Insecurities regarding the main question, i. e., whose turn is it anyways, become very notable when users try to establish contact, e. g., by saying hello -pause- hello. This kind of behavior certainly does not happen in HHI, even when we find long silences.</Paragraph> <Paragraph position="2"> Examining why silences in human-human interaction are unproblematic, we find that, these silences have been announced, e. g., by the human operator employing linguistic signals, such as just a moment please or well, I'll have to have a look in our database in order to communicate that he is holding on to the turn and finishing his round.</Paragraph> <Paragraph position="3"> To push the relay analogy even further, we can look at the differences in overlap as another indication of crucial dialogue inefficiency. Since most overlaps occur at the turn boundaries and, thusly, ensure a smooth (and fast) hand over, their absence constitutes another indication why we are far from having winning systems.</Paragraph> </Section> </Section> class="xml-element"></Paper>