File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1624_metho.xml

Size: 20,187 bytes

Last Modified: 2025-10-06 14:09:59

<?xml version="1.0" standalone="yes"?>
<Paper uid="W05-1624">
  <Title>An Experiment Setup for Collecting Data for Adaptive Output Planning in a Multimodal Dialogue System</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Experiment Setup
</SectionTitle>
    <Paragraph position="0"> We describe here some of the details of the experiment. The experimental setup is shown schematically in Figure 1. There are five people involved in each session of the experiment: an experiment leader, two transcribers, a user and a wizard.</Paragraph>
    <Paragraph position="1"> The wizards play the role of an MP3 player application and are given access to a database of information (but not actual music) of more than 150,000 music albums (almost 1 2Severity describes the number of hypotheses indicated by the wizard: having no interpretation, an uncertain interpretation, or several ambiguous interpretations.</Paragraph>
    <Paragraph position="2">  cation, as seen by the wizard. First-level of choice what to display.</Paragraph>
    <Paragraph position="3"> million songs), extracted from the FreeDB database.3 Figure 2 shows an example screen shot of the music database as it is presented to the wizard. Subjects are given a set of predefined tasks and are told to accomplish them by using an MP3 player with a multimodal interface. Tasks include playing songs/albums and building playlists, where the sub-ject is given varying amounts of information to help them find/decide on which song to play or add to the playlist. In a part of the session the users also get a primary driving task, using a Lane Change driving simulator [Mattes, 2003]. This enabled us to test the viability of combining primary and secondary task in our experiment setup. We also aimed to gain initial insight regarding the difference in interaction flow under such conditions, particularly with regard to multimodality. null The wizards can speak freely and display the search result or the playlist on the screen. The users can also speak as well as make selections on the screen.</Paragraph>
    <Paragraph position="4"> The user's utterances are immediately transcribed by a typist and also recorded. The transcription is then presented to the wizard.4 We did this for two reasons: (1) To deprive the wizards of information encoded in the intonation of utterances, because our system will not have access to it either. (2) To be able to corrupt the user input in a controlled way, simulating understanding problems at the acoustic level. Unlike [Stuttle et al., 2004], who simulate automatic speech recognition errors using phone-confusion models, we used a tool that &amp;quot;deletes&amp;quot; parts of the transcribed utterances, replacing them by three dots. Word deletion was triggered by the experiment leader. The word deletion rate varied: 20% of the utterances got weakly and 20% strongly corrupted. In 60% of the cases the wizard saw the transcribed speech uncorrupted.</Paragraph>
    <Paragraph position="5"> The wizard's utterances are also transcribed (and recorded)  ing options for screen output to the wizard for second-level of choice what to display an how.</Paragraph>
    <Paragraph position="6"> and presented to the user via a speech synthesizer. There are two reasons for doing this: One is to maintain the illusion for the subjects that they are actually interacting with a system, since it is known that there are differences between human-human and human-computer dialogue [Duran et al., 2001], and we want to elicit behavior in the latter condition; the other has to do with the fact that synthesized speech is imperfect and sometimes difficult to understand, and we wanted to reproduce this condition.</Paragraph>
    <Paragraph position="7"> The transcription is also supported by a typing and spelling correction module to minimize speech synthesis errors and thus help maintain the illusion of a working system.</Paragraph>
    <Paragraph position="8"> Since it would be impossible for the wizard to construct layouts for screen output on the fly, he gets support for his task from the WOZ system: When the wizard performs a database query, a graphical interface presents him a first level of output alternatives, as shown in Figure 2. The choices are found (i) albums, (ii) songs, or (iii) artists. For a second level of choice, the system automatically computes four possible screens, as shown in Figure 3. The wizard can chose one of the offered options to display to the user, or decide to clear the user's screen. Otherwise, the user's screen remains unchanged. It is therefore up to the wizard to decide whether to use speech only, display only, or to combine speech and display.</Paragraph>
    <Paragraph position="9"> The types of screen output are (i) a simple text-message conveying how many results were found, (ii) output of a list of just the names (of albums, songs or artists) with the corresponding number of matches (for songs) or length (for albums), (iii) a table of the complete search results, and (iv) a table of the complete search results, but only displaying a sub-set of columns. For each screen output type, the system uses heuristics based on the search to decide, e.g., which columns should be displayed. These four screens are presented to the wizard in different quadrants on a monitor (cf. Figure 3), allowing for selection with a simple mouse click. The heuristics for the decision what to display implement preliminary strategies we designed for our system. We are aware that due to the use of these heuristics, the wizard's output realization may not be always ideal. We have collected feedback from both the wizards and the users in order to evaluate whether the output options were satisfactory (cf. Section 4 for more details).</Paragraph>
    <Paragraph position="10"> Technical Setup To keep our experimental system modular and flexible we implemented it on the basis of the Open Agent Architecture (OAA) [Martin et al., 1999], which is a framework for integrating a community of software agents in a distributed environment. Each system module is encapsulated by an OAA wrapper to form an OAA agent, which is able to communicate with the OAA community. The experimental system consists of 12 agents, all of them written in Java. We made use of an OAA monitor agent which comes with the current OAA distribution to trace all communication events within the system for logging purposes.</Paragraph>
    <Paragraph position="11"> The setup ran distributed over six PCs running different versions of Windows and Linux.5</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Collected Data and Experience
</SectionTitle>
    <Paragraph position="0"> The SAMMIE-26 corpus collected in this experiment contains data from 24 different subjects, who each participated in one session with one of our six wizards. Each subject worked on four tasks, first two without driving and then two with driving.</Paragraph>
    <Paragraph position="1"> The duration was restricted to twice 15 minutes. Tasks were of two types: searching for a title either in the database or in an existing playlist, building a playlist satisfying a number of constraints. Each of the two sets for each subject contained one task of each type. The tasks again differed in how specific information was provided. We aimed to keep the difficulty level constant across users. The interactions were carried out in German.7 The data for each session consists of a video and audio recording and a logfile. Besides the transcriptions of the spoken utterances, a number of other features have been annotated automatically in the log files of the experiment, e.g., the wizard's database query and the number of found results, the type and form of the presentation screen chosen by the wizard, etc. The gathered logging information for a single experiment session consists of the communication events in chronological order, each marked by a timestamp. Based on this information, we can recapitulate the number of turns and the specific times that were necessary to accomplish a user task. We expect to use this data to analyze correlations be- null teraction Experiment. We have so far conducted two series of data-collection experiments: SAMMIE-1 involved only spoken interaction (cf. [Kruijff-Korbayov'a et al., 2005] for more details), SAMMIE-2 is the multimodal experiment described in this paper.</Paragraph>
    <Paragraph position="2"> 7However, most of the titles and artist names in the music database are in English.</Paragraph>
    <Paragraph position="3"> tween queries, numbers of results, and spoken and graphical presentation strategies.</Paragraph>
    <Paragraph position="4"> Whenever the wizard made a clarification request, the experiment leader invoked a questionnaire window on the screen, where the wizard had to classify his clarification request according to the primary source of the understanding problem. At the end of each task, users were asked to what extent they believed they accomplished their tasks and how satisfied they were with the results. Similar to methods used by [Skantze, 2003] and [Williams and Young, 2004], we plan to include subjective measures of task completion and correctness of results in our evaluation matrix, as task descriptions can be interpreted differently by different users. Each subject was interviewed immediately after the session. The wizards were interviewed once the whole experiment was over. The interviews were carried out verbally, following a prepared list of questions. We present below some of the points gathered through these interviews.</Paragraph>
    <Paragraph position="5"> Wizard Interviews All 6 wizards rated the overall understanding as good, i.e., that communication completed successfully. However, they reported difficulties due to delays in utterance transmission in both directions, which caused unnecessary repetitions due to unintended turn overlap.</Paragraph>
    <Paragraph position="6"> There were differences in how different wizards rated and used the different screen output options: The table containing most of the information about the queried song(s) or album(s) was rated best and shown most often by some wizards, while others thought it contained too much information and would not be clear at first glance for the users and hence they used it less or never. The screen option containing the least information in tabular form, namely only a list of songs/albums with their length, received complementary judgments: some of the wizards found it useless because it contained too little information, and they thus did not use it, and others found it very useful because it would not confuse the user by presenting too much information, and they thus used it frequently. Finally, the screen containing a text message conveying only the number of matches, if any, has been hardly used by the wizards. The differences in the wizards' opinions about what the users would find useful or not clearly indicate the need for evaluation of the usefulness of the different screen output options in particular contexts from the users' view point.</Paragraph>
    <Paragraph position="7"> When showing screen output, the most common pattern used by the wizards was to tell the user what was shown (e.g., I'll show you the songs by Prince), and to display the screen.</Paragraph>
    <Paragraph position="8"> Some wizards adapted to the user's requests: if asked to show something (e.g., Show me the songs by Prince), they would show it without verbal comments; but if asked a question (e.g., What songs by Prince are there? or What did you find?), they would show the screen output and answer in speech.</Paragraph>
    <Paragraph position="9"> Concerning the adaptation of multimodal presentation strategies w.r.t. whether the user was driving or not, four of the six wizards reported that they consciously used speech instead of screen output if possible when the user was driving.</Paragraph>
    <Paragraph position="10"> The remaining two wizards did not adapt their strategy.</Paragraph>
    <Paragraph position="11"> On the whole, interviewing the wizards brought valuable information on presentation strategies and the use of modalities, but we expect to gain even more insight after the annotation and evaluation of the collected data. Besides observations about the interaction with the users, the wizards also gave us various suggestions concerning the software used in the experiment, e.g., the database interface (e.g., the possibility to decide between strict search and search for partial matches, and fuzzy search looking for items with similar spelling when no hits are found), the screen options presenter (e.g., ordering of columns w.r.t. their order in the database interface, the possibility to highlight some of the listed items), and the speech synthesis system.</Paragraph>
    <Paragraph position="12"> Subject Interviews In order to use the wizards' behavior as a model for interaction design, we need to evaluate the wizards' strategies. We used user satisfaction, task experience, and multi-modal feedback behavior as evaluation metrics.</Paragraph>
    <Paragraph position="13"> The 24 experimental subjects were all native speakers of German with good English skills. They were all students (equally spread across subject areas), half of them male and half female, and most of them were between 20 to 30 years old.</Paragraph>
    <Paragraph position="14"> In order to calculate user satisfaction, users were interviewed to evaluate the system's performance with a user satisfaction survey. The survey probed different aspects of the users' perception of their interaction with the system. We asked the users to evaluate a set of five core metrics on a 5-point Likert scale. We followed [Walker et al., 2002] definition of the overall user satisfaction as the sum of text-to-speech synthesis performance, task ease, user expertise, over-all difficulty and future use. The mean for user satisfaction across all dialogues was 15.0 (with a standard derivation of 2.9). 8 A one-way ANOVA for user satisfaction between wizards (df=5, F=1.52 p=0.05) shows no significant difference across wizards, meaning that the system performance was judged to be about equally good for all wizards.</Paragraph>
    <Paragraph position="15"> To measure task experience we elicited data on perceived task success and satisfaction on a 5-point Likert scale after each task was completed. For all the subjects the final perceived task success was 4.4 and task satisfaction 3.9 across the 4 tasks each subject had to complete. For task success as well as for task satisfaction no significant variance across wizards was detected.</Paragraph>
    <Paragraph position="16"> Furthermore the subjects were asked about the employed multi-modal presentation and clarification strategies.</Paragraph>
    <Paragraph position="17"> The clarification strategies employed by the wizards seemed to be successful: From the subjects' point of view, mutual understanding was very good and the few misunderstandings could be easily resolved. Nevertheless, in the case of disambiguation requests and when grounding an utterance, subjects ask for more display feedback. It is interesting to note that subjects judged understanding difficulties on higher levels of interpretation (especially reference resolution problems and problems with interpreting the intention) to be more costly than problems on lower levels of understanding (like the acoustic understanding). For the clarification strategy this 8[Walker et al., 2002] reported an average user satisfaction of 16.2 for 9 Communicator systems.</Paragraph>
    <Paragraph position="18"> implies that the system should engage in clarification at the lowest level a error was detected.9 Multi-modal presentation strategies were perceived to be helpful in general, having a mean of 3.1 on a 5-point Likert scale. However, the subjects reported that too much information was being displayed especially for the tasks with driving. 85.7% of the subjects reported that the screen output was sometimes distracting them. 76.2% of the subjects would prefer to more verbal feedback, especially while driving. On a 3-point Likert scale subjects evaluated the amount of the information presented verbally to be about right (mean of 1.8), whereas they found the information presented on the screen to be too much (mean of 2.3). Studies by [Bernsen and Dybkjaer, 2001] on the appropriateness of using verbal vs. graphical feedback for in-car dialogues indicate that the need for text output is very limited. Some subjects in that study, as well subjects in our study report that they would prefer to not have to use the display at all while driving. On the other hand subjects in our study perceived the screen output to be very helpful in less stressful driving situations and when not driving (e.g. for memory assistance, clarifications etc.). Especially when they want to verify whether a complex task was finally completed (e.g. building a playlist), they ask for a displayed proof. For modality selection in in-car dialogues the driver's mental workload on primary and secondary task has to be carefully evaluated with respect to a situation model.</Paragraph>
    <Paragraph position="19"> With respect to multi-modality subjects also asked for more personalized data presentation. We therefore need to develop intelligent ways to reduce the amount of data being displayed. This could build on prior work on the generation of &amp;quot;tailored&amp;quot; responses in spoken dialogue according to a user model [Moore et al., 2004].</Paragraph>
    <Paragraph position="20"> The results for multi-modal feedback behavior showed no significant variations across wizards except for the general helpfulness of multi-modal strategies. An ANOVA Planned Comparison of the wizard with the lowest mean against the other wizards showed that his behavior was significantly worse. It is interesting to note, that this wizard was using the display less than the others. We might consider not to include the 4 sessions with this wizard in our output generation model.</Paragraph>
    <Paragraph position="21"> We also tried to analyze in more detail how the wizards' presentation strategies influenced the results. The option which was chosen most of the time was to present a table with the search results (78.6%); to present a list was only chosen in 17.5% of the cases and text only 0.04%. The wizards' choices varied significantly only for presenting the table option. The wizard who was rated lowest for multimodality was using the table option less, indicating that this option should be used more often. This is also supported by the fact that the show table option is the only presentation strategy which is positively correlated to how the user evaluated multimodality (Spearman's r = 0.436*). We also could find a 2-tailed corre9Note that engaging at the lowest level just helps to save dialogue &amp;quot;costs&amp;quot;. Other studies have shown that user satisfaction is higher for strategies that would &amp;quot;hide&amp;quot; the understanding error by asking questions on higher levels [Skantze, 2003], [Raux et al., 2005] lation between user satisfaction and multimodality judgment (Spearman's r = 0.658**). This indicates the importance of good multimodal presentation strategies for user satisfaction.</Paragraph>
    <Paragraph position="22"> Finally, the subjects were asked for own comments. They liked to be able to provide vague information, e.g., ask for &amp;quot;an oldie&amp;quot;, and were expecting collaborative suggestions. They also appreciated collaborative proposals based on inferences made from previous conversations.</Paragraph>
    <Paragraph position="23"> In sum, as the measures for user satisfaction, task experience, and multi-modal feedback strategies, the subjects' judgments show a positive trend. The dialogue strategies employed by most of the wizards seem to be a good starting point for building a baseline system. Furthermore, the results indicate that intelligent multi-modal generation needs to be adaptive to user and situation models.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML