XML Viewer - w97-1403

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1403_metho.xml
Size: 36,392 bytes
Last Modified: 2025-10-06 14:14:48
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1403">
  <Title>Towards Generation of Fluent Referring Action in Multimodal Situations</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
NTT Information and
Communication Systems Labs.
</SectionTitle>
    <Paragraph position="0"> Yokosuka, Kanagawa 239, JAPAN kat o(c)nttnly, isl. ntt. co. jp</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="0" type="metho">
    <SectionTitle>
NTT Information and
Communication Systems Labs.
</SectionTitle>
    <Paragraph position="0"> Yokosuka, Kanagawa 239, JAPAN yukiko(c)nttnly, isl. ntt. co. jp</Paragraph>
  </Section>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Referring actions in multimodal situations can be thought of as linguistic expressions well coordinated with several physical actions. In this paper, what patterns of linguistic expressions are commonly used and how physical actions are temporally coordinated to them are reported based on corpus examinations. In particular, by categorizing objects according to two features, visibility and membership, the schematic patterns of referring expressions are derived.</Paragraph>
    <Paragraph position="1"> The difference between the occurrence frequencies of those patterns in a multimodal situation and a spoken-mode situation explains the findings of our previous research.</Paragraph>
    <Paragraph position="2"> Implementation based on these results is on going.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> A lot of active studies have been conducted on the temporal coordination of natural language and visual information. The visual information considered includes pointing gestures (Andrd &amp; Rist, 1996), facial expressions and iconic gestures (Cassell et al., 1994), and graphical effects such as highlighting and blinking (Dalal et al., 1996; Feiner et ah, 1993).</Paragraph>
    <Paragraph position="1"> Among those we have been focusing on generating effective explanations by using natural language temporally coordinated with pictures and gestures.</Paragraph>
    <Paragraph position="2"> The experimental system we implemented is for explaining the installation and operation of a telephone with an answering machine feature, and simulates instruction dialogues performed by an expert in a face-to-face situation with a telephone in front of her (Kato et al., 1996). The system explains by using synthesized speech coordinated with pointing gestures from a caricatured agent and simulated operations implemented by the switching of figures. One of the important issues for enhancing this type of system is to shed light on what makes referring actions fluent in multimodal situations and to build a mechanism to generate such fluent actions.</Paragraph>
    <Paragraph position="3"> We also empirically investigated how communicative modes influence the content and style of referring actions made in dialogues (Kato &amp; Nakano, 1995). Experiments were conducted to obtain a corpus consisting of human-to-human instruction dialogues on telephone installation in two settings.</Paragraph>
    <Paragraph position="4"> One is a spoken-mode dialogue situation (SMD hereafter), in which explanations are given using just voice. The other is a multimodal dialogue situation (MMD hereafter), in which both voice and visual information, mainly the current state and outlook of the expert's telephone and her pointing gestures to it, can be communicated. Detailed analysis of the referring actions observed in that corpus revealed the following two properties.</Paragraph>
    <Paragraph position="5"> PI: The availability of pointing, communication through the visual channel reduces the amount of information conveyed through the speech or linguistic channel. In initial identification, the usage of linguistic expressions on shape/size, characters/marks, and related objects decreases in MMD, while the usage of position information does not decrease.</Paragraph>
    <Paragraph position="6"> P2: In SMD, referring actions tend to be realized to an explicit goal and divided into a series of fine-grained steps. The participants try to achieve them step by step with many confirmations.</Paragraph>
    <Paragraph position="7"> Although our findings were very suggestive for analyzing the properties of referring actions in multi-modal situations, they were still descriptive and not sufficient to allow their use in designing referring action generation mechanisms. Then, as the next step, we have been examining that corpus closer and trying to derive some schemata of referring actions, which would be useful for implementation of multimodal dialogue systems. This paper reports the results of these activities.</Paragraph>
    <Paragraph position="8"> Two short comments must be made to make our research standpoint clearer. First, our purpose is to generate referring actions that model human referring actions in mundane situations. Theoretically speaking, as Appelt pointed out, it is enough for referring to provide sufficient description to distin-Towards Generation of Fluent Referring Action in Multimodal Situations 21 guish one object from the other candidates (Appelt, 1985). For example, a pointing action to the object must be enough, or description of the object's position, such as &amp;quot;the upper left button of the dial buttons&amp;quot; also must be considered sufficient. However, we often observe referring actions that consist of a linguistic expression, &amp;quot;a small button with the mark of a handset above and to the left of the dial buttons&amp;quot;, accompanied with a pointing gesture. Such a referring action is familiar to us even though it is redundant from a theoretical viewpoint. Such familiar actions that the recipient does not perceive as awkward is called fluent in this paper. Our objective is to generate such fluent referring actions, and is rather different from those of (Appelt, 1985) and (Dale &amp; Haddock, 1991).</Paragraph>
    <Paragraph position="9"> Second, in our research, a referring action is considered as the entire sequence of actions needed for allowing the addressee to identify the intended object and incorporating its achievement into part of the participants' shared knowledge. In order to refer to an object in a box, an imperative sentence such as &amp;quot;Open the box, and look inside&amp;quot; may be used. Such a request shifts the addressee's attention, and to see it as a part of the referring action may be problematic. It is, however, reasonable to think that both the request for looking into the box and the assertion of the fact that an object is in the box come from different plans for achieving the same goal, identifying the object. As Cohen claimed that it is useful to understand referring expressions from the view-point of speech act planning (Cohen, 1984), it is not so ridiculous to go one step further and to consider the entire sequence of actions, including attention shifts, as an instance of a plan for object referring.</Paragraph>
    <Paragraph position="10"> Moreover, this approach better suits implementing a referring action generation mechanism as a planner.</Paragraph>
    <Paragraph position="11"> The next section describes what kinds of linguistic expression are used for referring actions in MMD and compares them with those in SMD. In particular, by categorizing objects according to two features: visibility and membership, schemata for object referring expressions of each category are derived. In the third section, how several kinds of actions such as pointing gestures are accompanied by such expressions is reported. In the fourth section, implementation of referring action generation is discussed based on our findings described thus far. Finally, in the last section, our findings are summarized and future work is discussed.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Linguistic expression in referring
actions
</SectionTitle>
    <Paragraph position="0"> Referring actions in multimodal situations can be thought of as linguistic expressions well coordinated with several physical actions. The linguistic expressions for referring to objects, referring expressions, are focused on in this section, and in the next sec- null tion, how those expressions should be coordinated with actions is discussed.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Object categorization
</SectionTitle>
      <Paragraph position="0"> The top and left side views of the telephone used are shown in Fig. 1. Although the objects such as buttons can be identified by using several features such as position, color, shape, and size, the two features described below proved to be dominant in the referring expressions used.</Paragraph>
      <Paragraph position="1"> Visibility: Some objects are located on the side or back of the telephone, and can not be seen unless the body is turned over or looked into.</Paragraph>
      <Paragraph position="2"> Some objects lie underneath the cover, and opening that cover is needed in order to see them. Such objects are categorized into invisible ones and distinguished from visible ones, which are located on the top 1.</Paragraph>
      <Paragraph position="3"> Membership: Aligned buttons of the same shape and color are usually recognized as a group.</Paragraph>
      <Paragraph position="4"> Members of such a group are distinguished from isolated ones 2.</Paragraph>
      <Paragraph position="5"> In Fig. 1, socket 1 on the side is invisible and isolated, button 1 on the left of the top surface is visible and isolated, button 2 on the lower right of the top surface is a visible group member, and button 3 on the upper right is an invisible group member as it is underneath a cassette cover usually closed.</Paragraph>
      <Paragraph position="6"> According to this categorization, we determined which patterns of referring expressions were frequently observed for each type of object. The patterns thus extracted can be expected to yield tAs you have already realized, this feature is not intrinsic to the object, but depends on the state of the telephone when the object is referred to. Buttons underneath the cover are visible when the cover is open. 2The recognition of a group may differ among people.</Paragraph>
      <Paragraph position="7"> In daily life, however, we believe an effective level of consensus can be attained.</Paragraph>
      <Paragraph position="8"> 22 T. Kato and Y.L Nakano schemata for referring expression generation. Three explanations of five experts in two situations, MMD and SMD, i.e. fifteen explanations in each situation, were analyzed. The apprentices differed with each explanation. Every referring expression analyzed involved initial identification, which is used to make the first effective reference to an object, and to introduce it into the explanation. All objects were referred to in the context in which the expert made the apprentice identify it and then requested that some action be performed on it. All explanations were done in Japanese 3.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Schemata for referring to visible
</SectionTitle>
      <Paragraph position="0"> isolated objects Referring actions to visible isolated objects are rather simple and were basic for all cases. Two major patterns were observed and can be treated as the schemata. Table 1 shows these two schemata 4, called RS1 and RS2 hereafter. RS1 contains two sentences. The first asserts the existence of the object at a described position. The second is an imperative sentence for requesting that an action be performed on the object identified. In the first sentence, a postpositional phrase describing the object position precedes the object description. The object description is a noun phrase that has modifiers describing features of the object such as its color or size followed by a head common noun describing the object category. That is, its structure is \[object description np\] \[feature description pp/comp\] * \[object class name n\] In RS2, the imperative sentence requesting an action contains a description referring to the object. This description has the same structure as RS1 shown above. In most cases, the first feature description is a description of the object position. In both schemata, object position is conveyed first, other features second, and the requested action follows. This order of information flow seems natural for identifying the object and then acting on it. Examples of referring expressions 5 that fit these schemata are 3 Japanese is a head-final language. Complements and postpositional phrases on the left modify nouns or noun phrases on the right, and construct noun phrases. That is, a simphfied version of Japanese grammar contains pp --~ np p, np ~ pp rip, np --~ comp rip, and np ~ n. Sentences are constructed by a rule, s ~ pp* v. The order of pps is almost free syntactically, being determined by pragmatic constraints. Older information precedes newer (Kuno, 1978).</Paragraph>
      <Paragraph position="1"> 4Schemata are represented as sequences of terminal symbols, non terminal symbols each of those has a form of \[semantic content syntactic category\], and schema ID. A slash appearing in a syntactic category means options rather than a slash feature.</Paragraph>
      <Paragraph position="2"> tAll examples are basically extracted from the corpus examined. Those, however, were slightly modified by (1) daiarubotan no hidariue ni juwaki no dial-buttons upper-left LOC handset maaku ga tsuita chiisai botan mark SUBJ being-placed-on small button ga arimasu, sore wo oshi tekudasai.</Paragraph>
      <Paragraph position="3"> SUBJ exist, it OBJ push REQUEST 'On the upper left of the dial buttons, there is a small button with the mark of a handset. Please push it.' (2) daiarubotan no hidariue nojuwaki no maaku dial-buttons upper-left handset mark ga tsuita chiisai botan wo oshi SUBJ being-placed-on small button OBJ push tekudasai.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
REQUEST
</SectionTitle>
    <Paragraph position="0"> 'Please push the small button with the mark of a handset on the upper left of the dial buttons.' In RS1, the achievement of identification is confirmed by making the first sentence a tag question or putting a phrase for confirmation after that sentence. Sometimes it is implicitly confirmed by asserting the existence of the object as the speaker's belief. In RS2, confirmation can be made by putting a phrase for confirmation after the noun phrase describing the object or by putting there a pause and a demonstrative pronoun appositively.</Paragraph>
    <Paragraph position="1"> Another pattern was observed in which RS1 was preceded by an utterance referring to a landmark used in the position description. This is also shown in Table 1 as RS11. In RS11, reference to the landmark is realized by an imperative sentence that directs attention to the landmark or a tag question that confirms its existence. Examples are</Paragraph>
    <Paragraph position="3"> hontai no hidariue wo mi tekudasai, soko body upper-left oaa look REQUEST there ni chiisai botan ga arimasu.</Paragraph>
    <Paragraph position="4"> LOC small button SUBJ exist 'Please look at the upper left of the body. There is a small button there.' daiarubotan no 1 arimasu yone. sono dial-button 1 exist CONFIRM its hidariue ni chiisai botan ga arimasu.</Paragraph>
    <Paragraph position="5"> upper-left LOC small button suBJ exist 'There is dial button 1, isn't there? On its upper left, there is a small button.' Table 1 shows the numbers of occurrences of each pattern in MMD and SMD. The total occurrence number was 30, as two objects fell under this category. RS11 and RS1 frequently occur in SMD. removing non-fluencies and the diversities caused by the factors mentioned in section 2.4 below.</Paragraph>
    <Paragraph position="6"> Towards Generation of Fluent Referring Action in Multimodal Situations 23</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.3 Schemata for referring to invisible
</SectionTitle>
      <Paragraph position="0"> objects and group members Five objects fell into the category of invisible isolated objects. Two schemata described in the previous subsection, RS1 and RS2, were used for referring to these objects by conveying the fact of which surface  the object was located on as the position description. For example, (5) hontai no hidarigawa ni sashikomiguchi ga body left-side LOC socket SUBJ arimasu, soko ni sore wo ire tekudasai.</Paragraph>
      <Paragraph position="1"> exist there LOC it OBJ put REQUEST 'There is a socket on the left side of the body. Please put it there.' (6) sore wo hontai hidarigawa no sashikomiguchi</Paragraph>
      <Paragraph position="3"> 'Please put it into the socket on the left side of the body.' In addition, RS11 and its RS2 corespondent, RS12, were used frequently. In these patterns, the surface on which the object is located is referred to in advance. It is achieved by an imperative sentence  that directs attention to the surface or asks that the body be turned, or by an description of the side followed by a confirmation. Examples are (7) hontai hidarigawa no sokumen wo mi body left side OBJ look tekudasai, soko ni ...</Paragraph>
      <Paragraph position="4"> REQUEST there LOC ...</Paragraph>
      <Paragraph position="5"> 'Please look at the left side of the body. On that side, ...' (8) hontai no hidari sokumen desu ne.</Paragraph>
      <Paragraph position="6"> body left side COPULA CONFIRM soko no mannakani ...</Paragraph>
      <Paragraph position="7">  there center LOC...</Paragraph>
      <Paragraph position="8"> 'The left side of the body, you see? On the center of that side, . ..' Table 2 shows the schemata based on these patterns and their numbers of occurrence; the total is 75. RS2 is frequently used in MMD, while RS11 is frequently used in SMD.</Paragraph>
      <Paragraph position="9"> For referring to a visible group member, patterns are observed in which the group the object belongs to is referred to as a whole, in advance, and then the object is referred to as a member of that group. The first sentence of RS1 is mainly used for referring to the group as a whole. For example, (9) daiarubotan no shita ni onaji iro no dial-buttons below LOC SAME color botan ga itsutsu narande imasu.</Paragraph>
      <Paragraph position="10"> buttons suBJ five aligned be 'Below the dial buttons, there are five buttons of the same color.' After this, RS1 or RS2 follows. These patterns, hereafter called RS21 and RS22, respectively, are shown in Table 3. In each pattern, the relative position of the object in the group is used as the position information conveyed later. In RS21, the following sentence, for example, follows the above.</Paragraph>
      <Paragraph position="11"> (10) sono migihashi ni supiika no maaku ga those right-most LOC speaker mark SUBJ tsuita botan ga arimasu being-placed-on button SUBJ exist 'On the right most of those, there is a button with the mark of a speaker.' RS1 and RS2, in which a referring expression to a group does not constitute an utterance by itself are also observed, such as (11) ichiban-shita no botan no retsu no migihashi bottom buttons line right-most ni supiika no maaku ga tsuita LOC speaker mark SUBJ being-placed-on botan ga arimasu.</Paragraph>
      <Paragraph position="12"> button SUBJ exist 'On the right most of the line of buttons on the bottom, there is a button with a mark of a speaker.' In the above, although the expression referring to the group is part of the expression referring to the  member, information that the object is a member of a specific group is conveyed and the position relative to the group is used for describing the object's position. There are other patterns which do not contain such descriptions of groups at all. For example, (12) hontai migishita no supiika botan wo oshi body right-lower speaker button oBz push tekudasai.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
REQUEST
</SectionTitle>
    <Paragraph position="0"> &amp;quot;Push the speaker button on the lower right of the body.' According to this characteristic, RS1 and RS2 are divided into two patterns. RS1 and RS2 with descriptions of a group are called RSI' and RS2' respectively, and RS1 and RS2 without descriptions of a group are called RSI&amp;quot; and RS2&amp;quot;. Table 3 shows the numbers of occurrence. The total number is 60, as four objects fell into this category 6. RSI&amp;quot; and RS2&amp;quot; are frequently observed in MMD, while RS21 and RS22 are frequently observed in SMD.</Paragraph>
    <Paragraph position="1"> Just one object was an invisible group member in our corpus. It was the button underneath the cassette cover. All referring expressions in both MMD and SMD contain an imperative sentence requesting that the cassette cover be opened. It is considered that this imperative sentence corresponds to the imperative sentences in RS11 and RS12 that direct attention to the side of the body or ask that the body be turned. Subsequent referring expressions follow the same patterns as for visible group members. The distribution of the patterns is also similar. That is, the schemata for referring to invisible group members are obtained as combinations of those for invisible objects and group members.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.4 Factors that complicate referring
expressions
</SectionTitle>
      <Paragraph position="0"> The previous two subsections derived the schemata for referring expressions in line with the objects' categorization based on two features. The schemata are  gests that referring expressions should be affected by the history of the group as well as of the object itself. just skeletons, and referring expressions with more diverse forms appear in the collected corpus. The most important origin of this diversity is that explanation dialogue is a cooperative process (Clark &amp; Wilkes-Gibbs, 1990). First, several stages of a referring action can trigger confirmation. Those confirmations are realized by using various linguistic devices such as interrogative sentences, tag questions, and specific intonations. Second, related to incremental elaboration, appositive and supplemental expressions are observed. For example, (13) rusu botan arimasu ne, gamen no OUT button exist CONFIRM display shita, &amp;quot;rusu&amp;quot; to kakareta shiroi botan.</Paragraph>
      <Paragraph position="1"> under &amp;quot;OUT&amp;quot; with being-labeled white button 'There is an OUT button, under the display, a white button labeled &amp;quot;OUT.&amp;quot;' These inherent dialogue features complicate referring expressions. Moreover, it is difficult to derive patterns from exchanges in which the apprentice plays a more active role such as talking about or checking her idea on the procedure in advance.</Paragraph>
      <Paragraph position="2"> The second origin of diversity relates to the fact that experts sometimes try to achieve multiple goals at the same time. Labeling an object with a proper name is sometimes achieved simultaneously with identifying it. This phenomena, however, could be schematized to some extent. Two patterns are observed. The one is to put the labeling sentence such as &amp;quot;This is called the speaker button&amp;quot; after the first sentence in RS1 or the noun phrase describing the object in RS2. The other is to use a proper name as the head of the noun phrase describing the object.</Paragraph>
      <Paragraph position="3"> An example is &amp;quot;the speaker button with the mark of a speaker&amp;quot;.</Paragraph>
      <Paragraph position="4"> The third origin is the effect of the dialogue context which is determined external to the referring expressions. For example, almost half of the referring expressions categorized into Others in the above tables fit one of the following two patterns, called RS3 hereafter.</Paragraph>
      <Paragraph position="5"> \[object function pp/comp\] \[object rip\] ga(SUBJ) \[position rip\] ni(LOC) arimasu(ezist).</Paragraph>
      <Paragraph position="6"> \[description of the features of the object s\] * Towards Generation of Fluent Referring Action in Multimodal Situations 25  \[referring to the group s\], RSi \[referring to the group s\], RS2</Paragraph>
      <Paragraph position="8"> \[object function pp/comp\] \[object np\] ga(SUBJ) \[position pp/comp\] \[object description up\] desu(COeVLa).</Paragraph>
      <Paragraph position="9"> Both patterns, which assert the features of the object including its position, handle the availability of the object as old information. Examples of RS3 are (14) onryou wo chousetsusuru botan ga volume oBJ control button SUBJ daiarubotan no hidariue ni arimasu.</Paragraph>
      <Paragraph position="10"> dial-buttons upper-left LOC exist 'The button for controlling the volume is located to the upper left of the dial buttons.' (15) sonotame no botan ga daiarubotan no for-it button suBJ dial-buttons hidariue ni aru chiisai botan desu.</Paragraph>
      <Paragraph position="11"> upper-left LOC exist small button COPULA 'The button for it is the small button to the upper left of the dial buttons.' These patterns are used when the existence of a specific function or an object used for such a function was previously asserted. In those cases, as such an information is old, RS3 is appropriate, while all other schemata described above are not. Although it must be possible to classify pattern RS3 into smaller classes and to discuss the occurrence frequency and the situations in which they occur, the small numbers involved prevented further investigation.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.5 Relation to previous research
</SectionTitle>
      <Paragraph position="0"> The occurrence frequency of each schemata listed above supports the findings of our previous research summarized as P1 and P2 in the introduction. In RS1 and RS2, which are basis of all schemata, the object position is conveyed almost always under the guidance of the schemata themselves. In particular, it is mandatory in RS1. So, the amount of information conveyed for identifying objects, how much is needed depends as a matter of course on the modes available, is controlled by feature descriptions other than position information. This causes P1, the property that the usage of position information does not decrease in MMD, while other kinds of information do decrease. In addition, this property is seen more strongly in MMD; RSI&amp;quot; and RS2&amp;quot; are used frequently wherein a group member directly is referred to directly to the object; the group is not mentioned.</Paragraph>
      <Paragraph position="1"> In SMD, RSI? and RS2? are used more frequently than in MMD. This means that references to the surface where the object is located and the group it belongs to tend to be made in an utterance different from the utterance referring to the object itself. In addition, R$'1 also appears more frequently in SMD than in MMD. This means an identification request and an action request are made separately.</Paragraph>
      <Paragraph position="2"> These are indications of P2, the property that actions tend to be realized as an explicit goal and divided into a series of fine-grained steps in SMD.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Actions coordinated with
</SectionTitle>
    <Paragraph position="0"> reference expressions In MMD, several kinds of physical actions accompany referring expressions. Proper coordination between such physical actions and linguistic expressions makes the referring actions fluent. In addition, referring expressions in MMD frequently use demonstratives such as &amp;quot;kore(this)&amp;quot; and &amp;quot;koko(here)&amp;quot; in relation to these actions. Investigating the constraints or patterns of this coordination and applying them to the schemata of referring expressions makes it possible to generate fluent action statements.</Paragraph>
    <Paragraph position="1"> Physical actions in referring actions in MMD are divided into the following three categories.</Paragraph>
    <Paragraph position="2"> Exhibit actions: Actions for making object visible such as turning a body or opening a cassette cover 7 .</Paragraph>
    <Paragraph position="3"> 7Exhibit actions contain both general actions like turning the body and machine specific actions like opening the cassette cover. There may be some differences between these two types of actions. For example, in referring expressions, the latter is usually requested directly by an imperative sentence, while the former is requested indirectly by directing attention to a specific side or implicitly by mentioning that side.</Paragraph>
    <Paragraph position="4"> 26 T. Kato and Y.L Nakano 0.0 2.5 5.0s I honntai hidarigawa no ire tekudasai body left-side \[ put REQUEST soshitara mou ittan wo koko no sashikomiguchi ni\] \[ then the other end OBJ here socket LOC Then, put the other end into this socket on the left side of the body  and picking up a handset. In instruction dialogues, experts sometimes just simulate these actions without actual execution.</Paragraph>
    <Paragraph position="5"> This section reports the patterns of temporal coordination of these actions with linguistic expressions, based on the observation of the corpus. Videotapes of just 48 referring actions (4 experts referred to 12 objects once each) were examined. As the amount of data is so small, we provide only a qualitative discussion.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Exhibit actions
</SectionTitle>
      <Paragraph position="0"> Only invisible objects need exhibit actions when they are referred to. Being those objects referred to, whichever scheme listed above is used, the information of the position or surface where the object is located is conveyed ahead of other features of the object. That is, letting the linguistic expression just before the referring expression be Lbfr, the position description be Lpos, and the object description be Lob1, the temporal relation of those can be summarized as follows using Allen's temporal logic (Allen, 1084).</Paragraph>
      <Paragraph position="1"> Lblr before Lpos before Lobj Accompanying these expressions, exhibit action At, and pointing gesture Ap, have the following relations. null Lobj starts Ap Lbyr before At before Lobj Lpo, overlaps \]overlaps -1 \] during \]during -1 Ae The pointing gesture to the object begins at the same time of the beginning of the object description. The exhibit action is done between the end of the utterance just before the referring action and the beginning of the object description. The exhibit action and position description relates loosely. There may be a temporal overlap between them or one may be done during the other. More precise relation than this could not concluded. In order to keep these relations, pauses of a proper length are put before and/or after the position description if needed.</Paragraph>
      <Paragraph position="2"> Fig. 2 shows a schematic depiction of the above relations and an example taken from the corpus.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Pointing gestures and simulated
operations
</SectionTitle>
      <Paragraph position="0"> Pointing gestures are faithfully synchronized to linguistic expressions. During action statements, almost all descriptions of objects or positions are accompanied by pointing gestures. Landmarks and object groups are also pointed to. When a pointing gesture is not made to the currently described object, no pointing gesture is made. Pointing gestures to objects other than the currently described one never happen. One exception to this constraint is scheme RS3. When the subject part of RS3, which is an object description, is uttered, no pointing gesture is provided. A pointing gesture begins as the position description begins.</Paragraph>
      <Paragraph position="1"> The linguistic description of an object, Lobj, and a pointing gesture to it, Ap, basically satisfy the temporal relation, Lobj starts Ap. That is, Lobj and Ap begin at the same time, but Ap lasts longer. However, the constraint mentioned above, that pointing gesture to objects other than currently described one never happen overrides this relation. As a result, in general, the pointing gesture to an object begins after finishing reference to other objects. As other  objects are usually mentioned as landmarks for describing the object position, a pointing gesture to the object begins midway through position description.</Paragraph>
      <Paragraph position="2"> Ap usually lasts after Lobj. In particular, a pointing gesture to the main object of a referring expression lasts till the utterance ends and the addressee acknowledges it. So, in the case of RS1, a pointing gesture lasts till the end of the sentence that asserts object existence.</Paragraph>
      <Paragraph position="3"> When more than one noun phrase or postpositional phrase describing the same object are uttered successively as in cases of appositive expressions, the pointing gestures are once suspended at the end of a phrase and resumed at the beginning of the next phrase. This is prominent when the next phrase begins with a demonstrative such as &amp;quot;this&amp;quot;.</Paragraph>
      <Paragraph position="4"> Simulated operations are synchronized with the verb describing the corresponding operation. Their synchronization is more precise than the case of exhibit actions. As a simulated operation such as button pushing is similar to a pointing gesture, a suspension and resumption similar to one mentioned above is done probably to distinguish them.</Paragraph>
      <Paragraph position="5"> Fig. 3 shows an example taken from the corpus. In this example, it is not clear whether the last action is a pointing gesture or a simulated operation.</Paragraph>
    </Section>
  </Section>
  <Section position="9" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Discussion on implementation
</SectionTitle>
    <Paragraph position="0"> We have began to implement a referring action generation mechanism using the schemata derived and coordination patterns described so far. The experimental system we are now developing shows a GIF picture of the telephone, and draws a caricatured agent over it. The pointing gestures are realized by redrawing the agent. As every picture is annotated by the object positions it contains, generating a pointing gesture and interpreting user's one are possible and easy. Other actions such as turning the body and opening the cassette cover are realized by playing a corresponding digital movie at exactly the same place as the GIF picture displayed s. The first frame of the digital move is the same as the GIF picture shown at that point of the time, and while the movie is being played, the picture in the background is switched to the one equaling the last frame of the movie. Fig. 4 depicts this mechanism. Those actions need considerable time as do human experts.</Paragraph>
    <Paragraph position="1"> This is in contrast to our previous system which implemented such actions by switching pictures so the time taken was negligible.</Paragraph>
    <Paragraph position="2"> The framework adopted for coordination between utterances and actions is synchronization by reference points (Blakowski, 1992). The beginning and end points of intonational phrases must be eligible as reference points. It's still under consideration if just waiting, for which pauses are put after the action finished earlier, is enough or more sophisticated operations such as acceleration and deceleration i.e.</Paragraph>
    <Paragraph position="3"> changing utterance speed, are needed. The need for dynamic planning such as used in PPP (Andre &amp; Rist, 1996) should be examined.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML