File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/01/w01-1622_metho.xml

Size: 12,374 bytes

Last Modified: 2025-10-06 14:07:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="W01-1622">
  <Title>Adding extra input/output modalities to a spoken dialogue system</Title>
  <Section position="5" start_page="2" end_page="2" type="metho">
    <SectionTitle>
2 Shortcomings of speech-only interfaces
</SectionTitle>
    <Paragraph position="0"> One of the issues that all dialogue systems with spoken input have to cope with is the imperfection of the speech recogniser. Even in very limited domains and with a small vocabulary speech recognition is never 100% accurate, if only because people may use OoD (Out of Domain) or OoV (Out of Vocabulary) words. To ensure that the user does not end up with wrong information, all slot values entered by the user must be confirmed. This can be done either explicitly in a separate question or implicitly, i.e. incorporated in the next question. Explicit confirmation results in a lot of extra turns, which means that the dialogue becomes less efficient and is often perceived as tedious, especially if all user utterances are understood correctly. Implicit confirmation, by contrast, does not necessarily increase the number of turns. However, it appears that users have difficulty in grasping the concept of implicit confirmation [Sturm, 1999]. Things run smoothly as long as the information to be confirmed is correct. If the speech recognition result is incorrect and wrong input expressions are confirmed implicitly, users tend to get confused and fail to repair the mistake that was made by the speech recogniser.</Paragraph>
    <Paragraph position="1"> In order to reduce the need for confirmation, confidence measures may be used. A confidence score is an estimate of how certain one can be that the recognition result is indeed correct. Using confidence scores in combination with one or more thresholds, would for instance allow to decide upon 1) ignoring the recognition result (if the confidence is minimal), 2) confirming the recognition result or 3) accepting the recognition result without confirmation (if the confidence is maximal). Unfortunately, it is virtually impossible to define thresholds in such a way that no false accepts (a user utterance is actually misrecognised but has a confidence score that exceeds the threshold) and no false rejects (user input was recognised correctly but has a confidence score that falls below the threshold) are caused. False rejects are not very harmful, although they do cause superfluous confirmation questions, and thus reduce the efficiency of the dialogue. False accepts, however, may become disastrous for the dialogue, since they cause incorrect values to be accepted without any confirmation. As a consequence, this strategy does not seem very attractive for speech-only systems. null Another problem with speech-only information systems is the way in which the eventual information is presented to the user. Shadowing experiments with different railway information systems indicate that users have difficulties understanding and writing down a travel advice presented in spoken form, especially if one or more transfers are involved [Claassen, 2000].</Paragraph>
    <Paragraph position="2"> Last, and perhaps foremost, it appears that users have difficulty in building a correct mental model of the functionality and the status of a speech-only system. This lack of understanding explains problems with exceptions handling, and the user's uncertainty as to what one can (or perhaps must) say at any given moment.</Paragraph>
  </Section>
  <Section position="6" start_page="2" end_page="2" type="metho">
    <SectionTitle>
3 Multimodality in MATIS
</SectionTitle>
    <Paragraph position="0"> The first goal of the MATIS project is to investigate to what extent graphical output along with speech prompts can solve the problems that are due to the lack of a consistent mental model. If, for example, recognition results are not only confirmed (implicitly) in speech prompts for additional input, but also displayed in the corresponding field on the screen, detecting recognition errors may become easier. The same should hold for navigation through the list of possible connections that is returned after the input is complete and a database query can be performed. null If no keyboard is available speech is ideal for making selections from long implicit lists, such as the departure city. However, other fields in a form may offer only a small number of options, which can easily be displayed on a screen. In the railway information system this holds for the switch that identifies the time as departure or arrival time (and to a large extent also for entering the date, which usually is today or tomorrow).</Paragraph>
    <Paragraph position="1"> Selections from short lists are most easily made by means of point-and-click operations. Therefore, we decided to add this input mode to speech input.</Paragraph>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.1 System Overview
</SectionTitle>
      <Paragraph position="0"> Our multimodal railway information system is an extended version of the mixed-initiative speech-only railway information system (OVIS) developed in the NWO-TST programme  . This is a very different starting point from most other projects in multimodal human-machine interaction, that seem to add speech to what is basically a user-driven desktop application. The user interface consists of a telephone handset in combination with a screen and a mouse. The MATIS system inherited an architecture in which modules communicate with each other using TCP socket connections under the control of a central module (Phrisco) (cf. Figure 1). The grey shaded modules have been added or extended for  In the next sections we will focus on the modules that have been added or changed and how these modules help to solve some of the problems described in Section 2.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.2 Screen output
</SectionTitle>
      <Paragraph position="0"> At the start of a dialogue an empty form is shown on the screen. In the course of the dialogue the fields are filled with the values provided by the user, who can use speech to fill all five slots in the form in a mixed-initiative dialogue, or use the mouse to select text fields and to make list selections. Once all slots have been filled, a travel advice is retrieved from the data-base and presented to the user in spoken and in textual form.</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.3 Mouse input
</SectionTitle>
      <Paragraph position="0"> Experiments have been conducted using a Wizard of Oz simulation of the MATIS system, to establish to what extent subjects use the mouse in addition to speech and in what way mouse input is used in an interaction that is essentially the original mixed-initiative spoken dialogue [Terken, 2001]. It appeared that half of the subjects used mouse input as well as speech input and that mouse input was primarily used to make selections from short lists, and much less to select editable text fields. The latter was done mostly in the context of error correction.</Paragraph>
    </Section>
    <Section position="4" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.4 Confidence calculation
</SectionTitle>
      <Paragraph position="0"> Confidence measures (CM) for spoken input can be calculated in different ways. In the MATIS system the CM is based on an N-best list of sentence hypotheses that is generated by the speech recogniser [Ruber, 1997]. This N-best confidence score rests on the assumption that words that occur in more entries in the N-best list are more likely to be correct:</Paragraph>
      <Paragraph position="2"> ) is the likelihood score of sentence hypothesis i in the N-best list. In this manner a CM is calculated for each word in the utterance.</Paragraph>
      <Paragraph position="3"> The N-best CM may give rise to a specific problem: if the N-best list contains only one entry, (1) automatically yields a maximum confidence score for each word in the utterance. Off-line experiments have shown that 3% of all N-best lists consisting of only one sentence actually contained recognition errors. Consequently, even if we only trust words with a maximum CM score, the false accept rate will be at least 3%. Other off-line experiments have shown that some improvement may be expected from combining the N-best CM with another CM that does not have this artefact.</Paragraph>
      <Paragraph position="4"> When a user fills a specific slot in the form using speech (s)he has to indicate which slot needs to be filled and provide a value for this slot. To obtain a CM for the slot value, the CMs of all words that were used to specify this value have to be combined. In the current implementation this was done by taking their mean.</Paragraph>
    </Section>
    <Section position="5" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.5 Multimodal Input Event Handler
</SectionTitle>
      <Paragraph position="0"> The information coming from the NLP module (in response to a spoken prompt) and from the mouse (that is active all the time) must be properly combined. This task is taken care of by the multimodal input event handler. To combine the information streams correctly, a time stamp must be attached to the inputs, indicating the temporal interval in which the action took place. This time interval is needed to decide which events should be combined [Oviatt, 1997].</Paragraph>
      <Paragraph position="1"> Furthermore, speech and mouse input may contain complementary, redundant or unrelated information. Complementary information (e.g.</Paragraph>
      <Paragraph position="2"> clicking on the 'destination' field and saying 'Rotterdam') is unified before it is sent to the dialogue manager. Unrelated information (e.g.</Paragraph>
      <Paragraph position="3"> clicking to select departure time while saying one or more station names) is first merged and then sent to the dialogue manager. In the case of redundant information (e.g. clicking on 'tomorrow' while saying 'tomorrow'), the information coming from the mouse is used to adapt the CM score attached to the speech input. Due to speech recognition errors, 'redundant' information may be conflicting (if the recogniser returns 'tomorrow' in the same time slot where 'today' is clicked). To solve this problem the information with the highest CM score will be trusted.</Paragraph>
    </Section>
    <Section position="6" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.6 Dialogue management
</SectionTitle>
      <Paragraph position="0"> The dialogue manager of the unimodal system was adapted in order to be able to use the CMs to decide on the confirmation strategy. In the present prototype we use only one threshold to decide upon the strategy. Values with a CM score below the threshold are shown on the screen and confirmed explicitly in the spoken dialogue. Values with a CM score exceeding the threshold are only shown on the screen. In case all or most values have a high CM score, this strategy speeds up the dialogue considerably.</Paragraph>
      <Paragraph position="1"> Preliminary experiments suggest that providing feedback visually as well as orally helps the user</Paragraph>
      <Paragraph position="3"> to develop an adequate model of the system.</Paragraph>
      <Paragraph position="4"> Also, since the user knows exactly what the information status of the system is at each point in the dialogue, correcting errors should be easier, which in turn will result in more effective dialogues. We are convinced that an increase in effectiveness and efficiency can be achieved, especially if the visual output is combined with auditory prompts that are more concise than in the speech-only system.</Paragraph>
    </Section>
    <Section position="7" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
3.7 Multimodal Output Event Handler
</SectionTitle>
      <Paragraph position="0"> In a multimodal system a decision has to be made as to whether the feedback to the user must be presented orally, visually, or in both ways. This is the task of the multimodal output event handler. For the time being we have decided to send all the output from the dialogue manager to the natural language generation module and the screen.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML