File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/concl/05/w05-1624_concl.xml
Size: 3,812 bytes
Last Modified: 2025-10-06 13:55:00
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1624"> <Title>An Experiment Setup for Collecting Data for Adaptive Output Planning in a Multimodal Dialogue System</Title> <Section position="5" start_page="0" end_page="0" type="concl"> <SectionTitle> 5 Conclusions and Future Steps </SectionTitle> <Paragraph position="0"> We have presented an experiment setup that enables us to gather multimodal interaction data aimed at studying not only the behavior of the users of the simulated system, but also that of the wizards. In order to simulate a dialogue system interaction, the wizards were only shown transcriptions of the user utterances, sometimes corrupted, to simulate automatic speech recognition problems. The wizard's utterances were also transcribed and presented to the user through a speech synthesizer. In order to make it possible for the wizards to produce contextually varied screen output in real time, we have included a screen output planning module which automatically calculated several screen output versions every time the wizard ran a database query. The wizards were free to speak and/or display screen output. The users were free to speak or select on the screen. In a part of each session, the user was occupied by a primary driving task.</Paragraph> <Paragraph position="1"> The main challenge for an experiment setup as described here is the considerable delay between user input and wizard response. This is due partly to the transcription and spelling correction step and partly due to the time it takes the wizard to decide on and enter a query to the database, then select a presentation and in parallel speak to the user. We have yet to analyze the exact distribution of time needed for these tasks. Several ways can be chosen to speed up the process. Transcription can be eliminated either by using speech recognition and dealing with its errors, or instead applying signal processing software, e.g., to filter out prosodic information from the user utterance and/or to transform the wizard's utterance into synthetically sounding speech (e.g., using a vocoder). Database search can be sped up in a number of ways too, ranging from allowing selection directly from the transcribed text to automatically preparing default searches by analyzing the user's utterance. Note, however, that the latter will most likely prejudice the wizard to stick to the proposed search.</Paragraph> <Paragraph position="2"> We plan to annotate the corpus, most importantly w.r.t.</Paragraph> <Paragraph position="3"> wizard presentation strategies and context features relevant for the choice between them. We also plan to compare the presentation strategies to the strategies in speech-only mode, for which we collected data in an earlier experiment (cf.</Paragraph> <Paragraph position="4"> [Kruijff-Korbayov'a et al., 2005]).</Paragraph> <Paragraph position="5"> For clarification strategies previous studies already showed that the decision process needs to be highly dynamic by taking into account various features such as interpretation uncertainties and local utility [Paek and Horvitz, 2000]. We plan to use the wizard data to learn an initial multi-modal clarification policy and later on apply reinforcement learning methods to the problem in order to account for long-term dialogue goals, such as task success and user satisfaction.</Paragraph> <Paragraph position="6"> The screen output options used in the experiment will also be employed in the baseline system we are currently implementing. The challenges involved there are to decide (i) when to produce screen output, (ii) what (and how) to display and (iii) what the corresponding speech output should be. We will analyze the corpus in order to determine what the suitable strategies are.</Paragraph> </Section> class="xml-element"></Paper>