File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/n04-1005_intro.xml

Size: 6,769 bytes

Last Modified: 2025-10-06 14:02:15

<?xml version="1.0" standalone="yes"?>
<Paper uid="N04-1005">
  <Title>Balancing Data-driven and Rule-based Approaches in the Context of a Multimodal Conversational System</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 The MATCH application
MATCH (Multimodal Access To City Help) is a work-
</SectionTitle>
    <Paragraph position="0"> ing city guide and navigation system that enables mobile users to access restaurant and subway information for New York City (NYC) (Johnston et al., 2002b; Johnston et al., 2002a). The user interacts with a graphical interface displaying restaurant listings and a dynamic map showing locations and street information. The inputs can be speech, drawing on the display with a stylus, or synchronous multimodal combinations of the two modes.</Paragraph>
    <Paragraph position="1"> The user can ask for the review, cuisine, phone number, address, or other information about restaurants and subway directions to locations. The system responds with graphical callouts on the display, synchronized with synthetic speech output. For example, if the user says phone numbers for these two restaurants and circles two restaurants as in Figure 1 [a], the system will draw a callout with the restaurant name and number and say, for example Time Cafe can be reached at 212-533-7000, for each restaurant in turn (Figure 1 [b]). If the immediate environment is too noisy or public, the same command can be given completely in pen by circling the restaurants and</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 MATCH Multimodal Architecture
</SectionTitle>
      <Paragraph position="0"> The underlying architecture that supports MATCH consists of a series of re-usable components which communicate over sockets through a facilitator (MCUBE) (Figure 2). Users interact with the system through a Multi-modal User Interface Client (MUI). Their speech and ink are processed by speech recognition (Sharp et al., 1997) (ASR) and handwriting/gesture recognition (GESTURE, HW RECO) components respectively. These recognition processes result in lattices of potential words and gestures. These are then combined and assigned a meaning representation using a multimodal finite-state device (MMFST) (Johnston and Bangalore, 2000; Johnston et al., 2002b). This provides as output a lattice encoding all of the potential meaning representations assigned to the user inputs. This lattice is flattened to an N-best list and passed to a multimodal dialog manager (MDM) (Johnston et al., 2002b), which re-ranks them in accordance with the current dialogue state. If additional information or confirmation is required, the MDM enters into a short information gathering dialogue with the user. Once a command or query is complete, it is passed to the multimodal generation component (MMGEN), which builds a multimodal score indicating a coordinated sequence of graphical actions and TTS prompts. This score is passed back to the Multimodal UI (MUI). The Multimodal UI coordinates presentation of graphical content with synthetic speech output using the AT&amp;T Natural Voices TTS engine (Beutnagel et al., 1999). The subway route constraint solver (SUBWAY) identifies the best route between any two points in New York City.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Multimodal Integration and Understanding
</SectionTitle>
      <Paragraph position="0"> Our approach to integrating and interpreting multimodal inputs (Johnston et al., 2002b; Johnston et al., 2002a) is an extension of the finite-state approach previously proposed (Bangalore and Johnston, 2000; Johnston and Bangalore, 2000). In this approach, a declarative multimodal grammar captures both the structure and the interpretation of multimodal and unimodal commands. The grammar consists of a set of context-free rules. The multi-modal aspects of the grammar become apparent in the terminals, each of which is a triple W:G:M, consisting of speech (words, W), gesture (gesture symbols, G), and meaning (meaning symbols, M). The multimodal grammar encodes not just multimodal integration patterns but also the syntax of speech and gesture, and the assignment of meaning. The meaning is represented in XML, facilitating parsing and logging by other system components.</Paragraph>
      <Paragraph position="1"> The symbol SEM is used to abstract over specific content such as the set of points delimiting an area or the identifiers of selected objects. In Figure 3, we present a small simplified fragment from the MATCH application capable of handling information seeking commands such as phone for these three restaurants. The epsilon symbol (a0 ) indicates that a stream is empty in a given terminal.</Paragraph>
      <Paragraph position="3"> In the example above where the user says phone for these two restaurants while circling two restaurants (Figure 1 [a]), assume the speech recognizer returns the lattice in Figure 4 (Speech). The gesture recognition component also returns a lattice (Figure 4, Gesture) indicating that the user's ink is either a selection of two restaurants or a geographical area. The multimodal grammar (Figure 3) expresses the relationship between what the user said, what they drew with the pen, and their combined meaning, in this case Figure 4 (Meaning). The meaning is generated by concatenating the meaning symbols and replacing SEM with the appropriate specific content: a2 cmda3a7a2 infoa3a8a2 typea3 phone a2 /typea3a7a2 obja3 a2 resta3 [r12,r15] a2 /resta3a9a2 /obja3a10a2 /infoa3a11a2 /cmda3 . For the purpose of evaluation of concept accuracy, we developed an approach similar to (Boros et al., 1996) in which computing concept accuracy is reduced to comparing strings representing core contentful concepts. We extract a sorted flat list of attribute value pairs that represents the core contentful concepts of each command from the XML output. The example above yields the following meaning representation for concept accuracy.</Paragraph>
      <Paragraph position="5"> The multimodal grammar can be used to create language models for ASR, align the speech and gesture results from the respective recognizers and transform the multimodal utterance to a meaning representation. All these operations are achieved using finite-state transducer operations (See (Bangalore and Johnston, 2000; Johnston and Bangalore, 2000) for details). However, this approach to recognition needs to be more robust to extra-grammaticality and language variation in user's utterances and the interpretation needs to be more robust to speech recognition errors. We address these issues in the rest of the paper.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML