File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2136_metho.xml

Size: 11,077 bytes

Last Modified: 2025-10-06 14:15:03

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2136">
  <Title>Confirmation in Multimodal Systems</Title>
  <Section position="4" start_page="824" end_page="825" type="metho">
    <SectionTitle>
2 QUICKSET
</SectionTitle>
    <Paragraph position="0"> This section describes QuickSet, a suite of agents for multimodal human-computer communication [4, 5].</Paragraph>
    <Section position="1" start_page="824" end_page="824" type="sub_section">
      <SectionTitle>
2.1 A Mulfi.Agem Architecture
</SectionTitle>
      <Paragraph position="0"> Underneath the QuickSet suite of agents lies a distributed, blackboard-based, multi-agent architecture based on the Open Agent Architecture' [23]. The blackboard acts as a repository of shared information and facilitator. The agents rely on it for brokering, rre.ssage distribution, and notification.</Paragraph>
      <Paragraph position="1"> ' qlac Open Agent Architecture is a tmde~ of SRI International.</Paragraph>
    </Section>
    <Section position="2" start_page="824" end_page="825" type="sub_section">
      <SectionTitle>
2.2 The QuickSet Agents
</SectionTitle>
      <Paragraph position="0"> The following section briefly summarizes the responsibilities of each agent, their interaction, and the results of their computation.</Paragraph>
      <Paragraph position="1">  The user draws on and speaks to the interface (see Figure 2 for a snapshot of the interface) to place objects on the map, assign attributes and behaviors to them, and ask questions about them.</Paragraph>
      <Paragraph position="2">  The gesture recognition agent recognizes gestures from strokes drawn on the map. Along with coordinate values, each stroke from the user interface provides contextual information about objects touched or encircled by the stroke. Recognition results are an n-best list (top n-ranked) of interpretations. The interpretations are encoded as typed feature structures [5], which represent each of the potential semantic contributions of the gesture. This list is then passed to the multimodal integrator.</Paragraph>
      <Paragraph position="3">  The Whisper speech recognition engine from Microsoft Corp. [24] drives the speech recognition agent. It offers speaker-independent, continuous recognition in close to real time. QuickSet relies upon a context-free domain grammar, specifically designed for each application, to constrain the speech recognizer. The speech recognizer  agent's output is also an n-best list of hypotheses and their probability estimates. These results are passed on for natural language interpretation.</Paragraph>
      <Paragraph position="4">  The natural language interpretation agent parses the output of the speech recognizer attempting to provide meaningful semantic interpretations based on a domain-specific grammar. This process may introduce further ambiguity; that is, more hypotheses. The results of parsing are, again, in the form of an n-best list of typed feature structures. When complete, the results of natural language interpretation are passed to the integrator for multimodal integration.</Paragraph>
      <Paragraph position="5">  The multimodal integration agent accepts typed feature structures from the gesture and natural language interpretation agents, and unifies them \[5\]. The process of integration ensures that modes combine according to a multimodal language specification, and that they meet certain multimodal timing and command-specific constraints. These constraints place limits on when different input can occur, thus reducing errors \[7\]. If after unification and constraint satisfaction, there is more than one completely specified command, the agent then computes the joint probabilities for each and passes the feature structure with the highest to the bridge. If, on the other hand, no completely specified command exists, a rrr.ssage is sent to the user interface, asking it to inform the user of the non-understanding.</Paragraph>
      <Paragraph position="6">  The bridge agent acts as a single message-based interface to domain applications. When it receives a feature structure, it sends a message to the appropriate applications, requesting that they execute the command.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="825" end_page="826" type="metho">
    <SectionTitle>
3 CONFIRMATION STRATEGIES
</SectionTitle>
    <Paragraph position="0"> Quickset supports two modes of confmnation: early, which uses the speech recognition hypothesis; and late, which renders the confirmation act graphically using the entire integrated multimodal command. These two modes are detailed in the following subsections.</Paragraph>
    <Section position="1" start_page="825" end_page="825" type="sub_section">
      <SectionTitle>
3.1 Early Confirmation
</SectionTitle>
      <Paragraph position="0"> Under the early confirmation strategy (see Figure 3), speech and gesture are immediately passed to their respective recognizers (la and lb). Electronic ink is used for immediate visual feedback of the gesture input. The highest-scoring speech-recognition hypothesis is returned to the user interface and displayed for confirmation (2). Gesture recognition results are forwarded to the integrator after processing (4).</Paragraph>
      <Paragraph position="1">  After confirmation of the speech, Quickset passes the selected sentence to the parser (3) and the process of integration follows (4). If, during confirmation, the system fails to present the correct spoken interpretation, users are given the choice of selecting it from a pop-up menu or respeaking the command (see Figure 2).</Paragraph>
    </Section>
    <Section position="2" start_page="825" end_page="826" type="sub_section">
      <SectionTitle>
3.2 Late Confirmation
</SectionTitle>
      <Paragraph position="0"> In order to meet the user's expectations, it was proposed that confmmtions occur after integration of the  Figure 5 is a snapshot of QuickSet in late confirmation mode. The user is indicating the placement of checkpoints on the terrain. She has just touched the map with her pen, while saying &amp;quot;YELLOW&amp;quot; to name the next checkpoint. In response, QuickSet has combined the gesture with the speech and graphically presented the  To confu'm or disconfima an object in either mode, the user can push either the SEND (checkrnark) or the E~,S~. (eraser) buttons, respectively. Altematively, to confn-rn the command in late confirmation mode, the user can rely on implicit confirmation, wherein QuickSet treats non-contradiction as a confirrnation [25-27]. In other words, if the user proceeds to the next command, she implicitly confLrrns the previous command.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="826" end_page="827" type="metho">
    <SectionTitle>
4 EXPERIMENTAL METHOD
</SectionTitle>
    <Paragraph position="0"> This section describes this experiment, its design, and how data were collected and evaluated.</Paragraph>
    <Section position="1" start_page="826" end_page="826" type="sub_section">
      <SectionTitle>
4.1 Subjects, Tasks, and Procedure
</SectionTitle>
      <Paragraph position="0"> Eight subjects, 2 male and 6 female adults, half with a computer science background and half without, were recruited from the OGI campus and asked to spend one hour using a prototypical system for disaster rescue planning.</Paragraph>
      <Paragraph position="1"> During training, subjects received a set of written instructions that described how users could interact with the system. Before each task, subjects received oral instructions regarding how the system would request confirmations. The subjects were equipped with microphone and pen, and asked to perform 20 typical commands as practice prior to data collection. They performed these cornrnands in one of the two confLrmation modes. After they had completed either the flood or the f'Lre scenario, the other scenario was introduced and the remaining cortfirmation mode was explained. At this time, the subject was given a chance to practice commands in the new confirmation mode, and then conclude the experiment.</Paragraph>
    </Section>
    <Section position="2" start_page="826" end_page="826" type="sub_section">
      <SectionTitle>
4.2 Research Design and Data Capture
</SectionTitle>
      <Paragraph position="0"> The research design was within-subjects with a single factor, confirmation mode, and repeated measures. Each of the eight subjects completed one fire-fighting and one flood-control rescue task, composed of approximately the same number and types of commands, for a strict recipe of about 50 multimodal commands. We counterbalanced the order of confm'nation mode and task, resulting in four different task and confwmation mode orderings.</Paragraph>
    </Section>
    <Section position="3" start_page="826" end_page="827" type="sub_section">
      <SectionTitle>
4.3 Transcript Preparation and Coding
</SectionTitle>
      <Paragraph position="0"> The QuickSet user interface was videotaped and microphone input was recorded while each of the subjects interacted with the system. The following dependent measures were coded from the videotaped sessions: time to complete each task, and the number of commands and repairs.</Paragraph>
      <Paragraph position="1"> 4.3.1 7qme to complete task The total elapsed time in minutes and seconds taken to complete each task was rrr.asured: from the first contact of the pen on the interface until the task was complete.  The number of commands attempted for each task was tabulated. Some subjects skipped commands, and most tended to add commands to each task, typically to navigate on the map (e.g., &amp;quot;PAN&amp;quot; and &amp;quot;ZOOM&amp;quot;). If the system misunderstood, the subjects were asked to attempt a command up to three times (repair), then proceed to the next one. Completely unsuccessful commands and the time spent on them, including repairs, were factored out of this study (1% of all commands). The number of turns to complete each task is the sum of the total number of commands attempted and any repairs.</Paragraph>
      <Paragraph position="2">  Several treasures were derived from the dependent rrmasures. Turns per command (tpc) describes how many turns it takes to successfully complete a command. Turns per minute (tpm) measures the speed with which the user interacts. A multirnodal error rate was calculated based on how often repairs were  necessary. Commands per m/nute (cpm) represents the rate at which the subject is able to issue successful commands, estimating the collaborative effort.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="827" end_page="827" type="metho">
    <SectionTitle>
5 RESULTS
</SectionTitle>
    <Paragraph position="0"/>
    <Paragraph position="2"> These results show that when comparing late with early confirmation: 1) subjects complete commands in fewer turns (the error rate and tpc are reduced, resulting in a 30% error reduction); 2) they complete tums at a faster rate (tpm is increased by 21%); and 3) they complete more commands in less time (cpm is increased by 26%).</Paragraph>
    <Paragraph position="3"> These results confirm all of our predictions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML