XML Viewer - w03-0701

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/03/w03-0701_abstr.xml
Size: 4,024 bytes
Last Modified: 2025-10-06 13:43:05
<?xml version="1.0" standalone="yes"?>
<Paper uid="W03-0701">
  <Title>Combining Semantic and Temporal Constraints for Multimodal Integration in Conversation Systems</Title>
  <Section position="2" start_page="0" end_page="1" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> In a multimodal conversation, user referring patterns could be complex, involving multiple referring expressions from speech utterances and multiple gestures.</Paragraph>
    <Paragraph position="1"> To resolve those references, multimodal integration based on semantic constraints is insufficient. In this paper, we describe a graph-based probabilistic approach that simultaneously combines both semantic and temporal constraints to achieve a high performance.</Paragraph>
    <Paragraph position="2">  Introduction Multimodal conversation systems allow users to converse with systems through multiple modalities such as speech, gesture and gaze (Cohen et al., 1996; Wahlster, 1998). In such an environment, not only are more interaction modalities available, but also richer contexts are established during the interaction. Understanding user inputs, for example, what users refer to is important. Previous work on multimodal reference resolution includes the use of a focus space model (Neal et al., 1998), the centering framework (Zancanaro et al., 1997), context factors (Huls et al., 1995), and rules (Kehler 2000). These previous approaches focus on semantics constraints without fully addressing temporal constraints. In a user study  , we found that the majority of user referring behavior involved one referring expression and one gesture (as in [S2, G2] in Table 1). The earlier approaches worked well for these types of references.</Paragraph>
    <Paragraph position="3"> However, we found that 14.1% of the inputs were complex, which involved multiple referring expressions from speech utterances and multiple gestures (S3 in Table 1). To resolve those complex references, we have to not only apply semantic constraints, but also apply temporal constraints at the same time.</Paragraph>
    <Paragraph position="4"> For example, Figure 1 shows three inputs where the number of referring expressions is the same and the number of gestures is the same. The speech utterances and gestures are aligned along the time axis. The first case (Figure 1a) and the second case (Figure 1b) have the same speech utterance but different temporal alignment between the gestures and the speech input. The second case and the third case (Figure 1c) have a similar alignment, but the third case provides an additional constraint on the number of referents (from the word &amp;quot;two&amp;quot;). Although all three cases are similar, but the objects they refer to are quite different in each case. In the first case, most likely &amp;quot;this&amp;quot; refers to the house selected by the first point gesture and &amp;quot;these houses&amp;quot; refers to two houses selected by the other two gestures. In the second case, &amp;quot;this&amp;quot; most likely refers to the highlighted house on the display and &amp;quot;these houses&amp;quot; refer to three houses selected by the gestures. In the third case, &amp;quot;this&amp;quot; most likely refers to the house selected by the first point gesture and &amp;quot;these two houses&amp;quot; refers to two houses selected by the other two point gestures.</Paragraph>
    <Paragraph position="5">  mul. gest.</Paragraph>
    <Paragraph position="6">  one gest  We are developing a system that helps users find real estate properties. So here we use real estate as the testing domain. Gesture input: ................................................  interaction context. The timings of the point gestures are denoted by &amp;quot;&amp;quot;.</Paragraph>
    <Paragraph position="7"> Resolving these complex cases requires simultaneously satisfying semantic constraints from inputs and the interaction contexts, and the temporal constraints between speech and gesture.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML