File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1036_metho.xml

Size: 18,726 bytes

Last Modified: 2025-10-06 14:14:38

<?xml version="1.0" standalone="yes"?>
<Paper uid="P97-1036">
  <Title>Unification-based Multimodal Integration</Title>
  <Section position="4" start_page="0" end_page="281" type="metho">
    <SectionTitle>
J
</SectionTitle>
    <Paragraph position="0"> The multimodal command involves speech recognition of only a three word phrase, while the equivalent unimodal speech command involves recognition of a complex twenty four word expression. Furthermore, using unimodal speech to indicate more com- null plex spatial features such as routes and areas is practically infeasible if accuracy of shape is important. Another significant advantage of multimodal over unimodal speech is that it allows the user to switch modes when environmental noise or security concerns make speech an unacceptable input medium, or for avoiding and repairing recognition errors (Oviatt and Van Gent 1996). Multimodality also offers the potential for input modes to mutually compensate for each others' errors. We will demonstrate :~'~.,, in our system, multimodal integration allows speech input to compensate for errors in gesture recognition and vice versa.</Paragraph>
    <Paragraph position="1"> Systems capable of integration of speech and gesture have existed since the early 80's. One of the first such systems was the &amp;quot;Put-That-There&amp;quot; system (Bolt 1980). However, in the sixteen years since then, research on multimodal integration has not yielded a reusable scalable architecture for the construction of multimodal systems that integrate gesture and voice. There are four major limiting factors in previous approaches to multimodal integration: (1) The majority of approaches limit the bandwidth of the gestural mode to simple deictic pointing gestures made with a mouse (Neal and Shapiro 1991, Cohen 1991, Cohen 1992, Brison and Vigouroux (ms.), Wauchope 1994) or with the hand (Koons et al 19931).</Paragraph>
    <Paragraph position="2"> (ii) Most previous approaches have been primarily speech-driven ~ , treating gesture as a secondary dependent mode (Neal and Shapiro 1991, Cohen 1991, Cohen 1992, Brison and Vigouroux (ms.), Koons et al 1993, Wauchope 1994). In these systems, integration of gesture is triggered by the appearance of expressions in the speech stream whose reference needs to be resolved, such as definite and deictic noun phrases (e.g.</Paragraph>
    <Paragraph position="3"> 'this one', 'the red cube').</Paragraph>
    <Paragraph position="4"> (iii) None of the existing approaches provide a well-understood generally applicable common meaning representation for the different modes, or, (iv) A general and formally-welldefined mechanism for multimodal integration.</Paragraph>
    <Paragraph position="5"> I Koons et al 1993 describe two different systems. The first uses input from hand gestures and eye gaze in order to aid in determining the reference of noun phrases in the speech stream. The second allows users to manipulate objects in a blocks world using iconic and pantomimic gestures in addition to deictic gestures.</Paragraph>
    <Paragraph position="6"> ~More precisely, they are 'verbal language'-driven.</Paragraph>
    <Paragraph position="7"> Either spoken or typed linguistic expressions are the driving force of interpretation.</Paragraph>
    <Paragraph position="8"> We present an approach to multimodal integration which overcomes these limiting factors. A wide base of continuous gestural input is supported and integration may be driven by either mode. Typed feature structures (Carpenter 1992) are used to provide a clearly defined and well understood common meaning representation for the modes, and multi-modal integration is accomplished through unification. null</Paragraph>
  </Section>
  <Section position="5" start_page="281" end_page="282" type="metho">
    <SectionTitle>
2 Quickset: A Multimodal Interface
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="281" end_page="282" type="sub_section">
      <SectionTitle>
for Distributed Interactive
Simulation
</SectionTitle>
      <Paragraph position="0"> The initial application of our multimodal interface architecture has been in the development of the QuickSet system, an interface for setting up and interacting with distributed interactive simulations.</Paragraph>
      <Paragraph position="1"> QuickSet provides a portal into LeatherNet 3, a simulation system used for the training of US Marine Corps platoon leaders. LeatherNet simulates training exercises using the ModSAF simulator (Courtemanche and Ceranowicz 1995) and supports 3D visualization of the simulated exercises using CommandVu (Clarkson and Yi 1996). SRI International's CommandTalk provides a unimodal spoken interface to LeatherNet (Moore et al 1997).</Paragraph>
      <Paragraph position="2"> QuickSet is a distributed system consisting of a collection of agents that communicate through the Open Agent Architecture 4 (Cohen et al 1994). It runs on both desktop and hand-held PCs under Windows 95, communicating over wired and wireless LANs (respectively), or modem links. The wireless hand-held unit is a 3-1b Fujitsu Stylistic 1000 (Figure 2). We have also developed a Java-based QuickSet agent that provides a portal to the simulation over the World Wide Web. The QuickSet user interface displays a map of the terrain on which the simulated military exercise is to take place (Figure 2). The user can gesture and draw directly on the map with the pen and simultaneously issue spoken commands. Units and objectives can be laid down on the map by speaking their name and gesturing on the desired location. The map can also be annotated with line features such as barbed wire and fortified lines, and area features such as minefields and landing zones. These are created by drawing the appropriate spatial feature on the map and speak- null ing its name. Units, objectives, and lines can also be generated using unimodal gestures by drawing their map symbols in the desired location. Orders can be assigned to units, for example, in Figure 2 an M1A1 platoon on the bottom left has been assigned a route to follow. This order is created multimodally by drawing the curved route and saying</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="282" end_page="283" type="metho">
    <SectionTitle>
'WHISKEY FOUR SIX FOLLOW THIS ROUTE'.
</SectionTitle>
    <Paragraph position="0"> As entities are created and assigned orders they are displayed on the UI and automatically instantiated in a simulation database maintained by the ModSAF simulator.</Paragraph>
    <Paragraph position="1"> Speech recognition operates in either a click-to-speak mode, in which the microphone is activated when the pen is placed on the screen, or open microphone mode. The speech recognition agent is built using a continuous speaker-independent recognizer commercially available from IBM.</Paragraph>
    <Paragraph position="2"> When the user draws or gestures on the map, the resulting electronic 'ink' is passed to a gesture recognition agent, which utilizes both a neural network and a set of hidden Markov models. The ink is sizenormalized, centered in a 2D image, and fed into the neural network as pixels, as well as being smoothed, resampled, converted to deltas, and fed to the HMM recognizer. The gesture recognizer currently recognizes a total of twenty six different gestures, some of which are illustrated in Figure 3. They include various military map symbols such as platoon, mortar, and fortified line, editing gestures such as deletion, and spatial features such as routes and areas.</Paragraph>
    <Paragraph position="3">  As with all recognition technologies, gesture recognition may result in errors. One of the factors  contributing to this is that routes and areas do not have signature shapes that can be used to identify them and are frequently confused (Figure 4).</Paragraph>
    <Paragraph position="5"> Another contributing factor is that users' pen input is often sloppy (Figure 5) and map symbols can be confused among themselves and with route and area gestures.</Paragraph>
    <Paragraph position="6"> mortar tank deletion mechanized platoon company  Given the potential for error, the gesture recognizer issues not just a single interpretation, but a series of potential interpretations ranked with respect to probability. The correct interpretation is frequently determined as a result of multimodal integration, as illustrated below 5.</Paragraph>
  </Section>
  <Section position="7" start_page="283" end_page="286" type="metho">
    <SectionTitle>
3 A Unification-based Architecture
</SectionTitle>
    <Paragraph position="0"> for Multimodal Integration One the most significant challenges facing the development of effective multimodal interfaces concerns the integration of input from different modes. Input signals from each of the modes can be assigned meanings. The problem is to work out how to combine the meanings contribute d by each of the modes in order to determine what the user actually intends to communicate.</Paragraph>
    <Paragraph position="1"> To model this integration, we utilize a unification operation over typed feature structures (Carpenter 1990, 1992, Pollard and Sag 1987, Calder 1987, King SSee Wahlster 1991 for discussion of the role of dialog in resolving ambiguous gestures.</Paragraph>
    <Paragraph position="2"> 1989, Moshier 1988). Unification is an operation that determines the consistency of two pieces of partial information, and if they are consistent combines them into a single result. As such, it is ideally suited to the task at hand, in which we want to determine whether a given piece of gestural input is compatible with a given piece of spoken input, and if they are compatible, to combine the two inputs into a single result that can be interpreted by the system.</Paragraph>
    <Paragraph position="3"> The use of feature structures as a semantic representation framework facilitates the specification of partial meanings. Spoken or gestural input which partially specifies a command can be represented as an underspecified feature structure in which certain features are not instantiated. The adoption of typed feature structures facilitates the statement of constraints on integration. For example, if a given speech input can be integrated with a line gesture, it can be assigned a feature structure with an under-specified location feature whose value is required to be of type line.</Paragraph>
    <Paragraph position="4">  QuickSet system. Spoken and gestural input originates in the user interface client agent and it is passed on to the speech recognition and gesture recognition agents respectively. The natural language agent uses a parser implemented in Prolog to parse strings that originate from the speech recognition agent and assign typed feature structures to  them. The potential interpretations of gesture from the gesture recognition agent are also represented as typed feature structures. The multimodal integration agent determines and ranks potential unifications of spoken and gestural input and issues complete commands to the bridge agent. The bridge agent accepts commands in the form of typed feature structures and translates them into commands for whichever applications the system is providing an interface to.</Paragraph>
    <Paragraph position="5"> For example, if the user utters 'M1A1 PLA-TOON', the name of a particular type of tank platoon, the natural language agent assigns this phrase the feature structure in Figure 7. The type of each feature structure is indicated in italics at its bottom  right or left corner.</Paragraph>
    <Paragraph position="6"> object : echelon : platoon unit create_unit location : \] point  Since QuickSet is a task-based system directed toward setting up a scenario for simulation, this phrase is interpreted as a partially specified unit creation command. Before it can be executed, it needs a location feature indicating where to create the unit, which is provided by the user's gesturing on the screen. The user's ink is likely to be assigned a number of interpretations, for example, both a point interpretation and a line interpretation, which the gesture recognition agent assigns typed feature structures (see Figures 8 and 9). Interpretations of gestures as location features are assigned a general command type which unifies with all of commands taken by the system. \[ \[xcoord 9 30 \] \]  The task of the integrator agent is to field incoming typed feature structures representing interpretations of speech and of gesture, identify the best potential interpretation, multimodal or unimodal, and issue a typed feature structure representing the preferred interpretation to the bridge agent, which will execute the command. This involves parsing of the speech and gesture streams in order to determine potential multimodal integrations. Two factors guide this: tagging of speech and gesture as either complete or partial and examination of time stamps associated with speech and gesture.</Paragraph>
    <Paragraph position="7"> Speech or gesture input is marked as complete if it provides a full command specification and therefore does not need to be integrated with another mode.</Paragraph>
    <Paragraph position="8"> Speech or gesture marked as partial needs to be integrated with another mode in order to derive an executable command.</Paragraph>
    <Paragraph position="9"> Empirical study of the nature of multimodal interaction has shown that speech typically follows gesture within a window of a three to four seconds while gesture following speech is very uncommon (Oviatt et al 97). Therefore, in our multimodal architecture, the integrator temporally licenses integration of speech and gesture if their time intervals overlap, or if the onset of the speech signal is within a brief time window following the end of gesture. Speech and gesture are integrated appropriately even if the integrator agent receives them in a different order from their actual order of occurrence. If speech is temporally compatible with gesture, in this respect, then the integrator takes the sets of interpretations for both speech and gesture, and for each pairing in the product set attempts to unify the two feature structures. The probability of each multimodal interpretation in the resulting set licensed by unification is determined by multiplying the probabilities assigned to the speech and gesture interpretations.</Paragraph>
    <Paragraph position="10"> In the example case above, both speech and gesture have only partial interpretations, one for speech, and two for gesture. Since the speech interpretation (Figure 7) requires its location feature to be of type point, only unification with the point interpretation of the gesture will succeed and be passed on as a valid multimodal interpretation (Figure 10).</Paragraph>
    <Paragraph position="12"> The ambiguity of interpretation of the gesture was resolved by integration with speech which in this case required a location feature of type point. If the spoken command had instead been 'BARBED  WIRE' it would have been assigned the feature structure in Figure 11. This structure would only unify with the line interpretation of gesture resulting in the interpretation in Figure 12.</Paragraph>
    <Paragraph position="13">  Similarly, if the spoken command described an area, for example an 'ANTI TANK MINEFIELD' , it would only unify with an interpretation of gesture as an area designation. In each case the unification-based integration strategy compensates for errors in gesture recognition through type constraints on the values of features.</Paragraph>
    <Paragraph position="14"> Gesture also compensates for errors in speech recognition. In the open microphone mode, where the user does not have to gesture in order to speak, spurious speech recognition errors are more common than with click-to-speak, but are frequently rejected by the system because of the absence of a compatible gesture for integration. For example, if the system spuriously recognizes 'M1A1 PLATOON', but there is no overlapping or immediately preceding gesture to provide the location, the speech will be ignored. The architecture also supports selection among n-best speech recognition results on the basis of the preferred gesture recognition. In the future, n-best recognition results will be available from the recognizer, and we will further examine the potential for gesture to help select among speech recognition alternatives. null Since speech may follow gesture, and since even simultaneously produced speech and gesture are processed sequentially, the integrator cannot execute what appears to be a complete unimodal command on receiving it, in case it is immediately followed by input from the other mode suggesting a multimodal interpretation. If a given speech or gesture input has a set of interpretations including both partial and complete interpretations, the integrator agent waits for an incoming signal from the other mode. If no signal is forthcoming from the other mode within the time window, or if interpretations from the other mode do not integrate with any interpretations in the set, then the best of the complete unimodal interpretations from the original set is sent to the bridge agent.</Paragraph>
    <Paragraph position="15"> For example, the gesture in Figure 13 is used for unimodal specification of the location of a fortified line. If recognition is successful the gesture agent would assign the gesture an interpretation like that in Figure 14.</Paragraph>
    <Paragraph position="16">  However, it might also receive an additional potential interpretation as a location feature of a more general line type (Figure 15).</Paragraph>
    <Paragraph position="17">  On receiving this set of interpretations, the integrator cannot immediately execute the complete interpretation to create a fortified line, even if it is assigned the highest probability by the recognizer, since speech contradicting this may immediately follow. For example, if overlapping with or just after the gesture, the user said 'BARBED WIRE' then the line feature interpretation would be preferred. If speech does not follow within the three to four second window, or following speech does not integrate with the gesture, then the unimodal interpretation  is chosen. This approach embodies a preference for multimodal interpretations over unimodal ones, motivated by the possibility of unintended complete unimodal interpretations of gestures. After more detailed empirical investigation, this will be refined so that the possibility of integration weighs in favor of the multimodal interpretation, but it can still be beaten by a unimodal gestural interpretation with a significantly higher probability.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML