File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/97/w97-1413_abstr.xml
Size: 15,166 bytes
Last Modified: 2025-10-06 13:49:09
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1413"> <Title>Constraints on the Use of Language, Gesture and Speech for Multimodal Dialogues</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> 1. Introduction </SectionTitle> <Paragraph position="0"> In the domain of natural language understanding and more precisely man-machine dialogue design, there are usually two trends of research which seem to be rather differentiated. On the one hand, many studies have tackled the problem of interpreting spatial references expressed in verbal utterances, focusing in particular on the different geometric or functionnal constraints which are bound to the existance of a &quot;source&quot; (or site) element in relation to which a 'target&quot; is being situated. Such studies are usually based upon fine grained linguistic descriptions for different languages (Vandeloise, 1986). On the other hand, the problem raised by the integration of a gestural mode within classical NL interfaces has yielded some specific research about the association of demonstrative or deictic Nps together with designations, as initited by Bolt some two decades ago (cf. Thorisson et alii, 1992; Bellalem and Romary, 1995). Our aim in this paper is to show that the different phenomena described in the context of spatial reference or multimodal interaction should not necessarily be considered as two independant issues, but should rather be analysed in a unified way to account for the fact that they are both based on linguistic and perceptual data.</Paragraph> <Paragraph position="1"> As a matter of fact, if we consider a situation of man-machine dialogue where the user is presented with a graphical representation of his task, it is clear that, given a certain informational content he wants to convey, h e will essentially choose a referring mode which seems most relevant in the current communicative situation. For example, if we consider a graphical situation such as that described in figure 1.1, he may either use the black triangle, this triangle (+ pointing gesture), the leftmost triangle to refer to the left most object, and it would be quite annoying to consider these different expressions as In this context, we will try to show how language, gesture and perception can be seen in a uniform way from the perspective of referential analysis, even if doing so we will have to look at the specific constraints which underly the speaker's choice of a given expression. To this end, we will first quickly situate the relative importance of speech and gesture in man-machine communication. Then, we will concentrate upc~ the specific effects resulting from the combination of verbal, gestural and perceptual information, showing that on the one hand the three provide structural constraints to the objects which are being referred to and on the other hand that any referring operation, whatever its origin, has to be interpreted within a localized frame, with some consequences upon dialogue management.</Paragraph> <Paragraph position="2"> 2. Several means to make a referring act When designating a given object within a visual environment, it seem at first sight that Natural Language provides uncomparable means to do so as opposed to gesture. Beyond the different determiners which are present in most natural languages either explicitely or implicitely (indefinite, definite or demonstrative), nominal categories allows one to set the proper level of granularity corresponding to the intended object. Indeed, in a situation where a gesture would be ambiguous and point to the overall scene (a set of geometrical shapes), a specific i In particular, gricean maxims as well as relevance theory (Sperber and Wilson, 1986) would tend towards an analysis which compare the different referring expressions in terms of cognitive cost. Constraints on the Use of Language, Gesture and Speech for Multimodal Dialogues 95 object (a triangle) or any of its part (a segment, a point etc.), the sole phrase the triangle may directly designate what is being intended.</Paragraph> <Paragraph position="3"> Another important aspect is that pointing gestures 2, when used in the general framework of an oral dialogue, can seldom appear in isolation, wheras a definite description such as the blue triangle can clearly be expressed independantly of any gesture. The reason for this is, as we said, that the intrisic ambiguity of gesture implies that it should be complemented with a categorizing expression, but also because a gesture cannot express very easily an action to be performed upon the by means of formulae (at step b) such as:</Paragraph> <Paragraph position="5"> Although these two referential expressions, if ever used, are unambiguous, we would certainly prefer such examples as: Figure 2.1.b: the leflmost triangles and * designated _,object and has thus to be also complemented by a predicative utterance. In * this latter case, it is hard to imagine that any combination will be possible between linguistic chunks and gestural acts. In particular, gesture can hardly fill a role which is mandatory for a given predicate, since it would lead to odd utterances such as ?give me the color of \[pointing\].</Paragraph> <Paragraph position="6"> 3. Reference and contrast The schematic algorithm used in most dialogue systems in order to deal with referential NPs (in the case they get their reference within a context which is visually presented) can be expressed as: a) get all the indices from the expression; b) deduce from these indices some constraints which must be true in the visual representation; c) filter the referent(s) thanks to these constraints.</Paragraph> <Paragraph position="7"> In such a framework, what would be expected as the system's perceptuel abilities boils down to an ability to build the set of objects appearing on the screen. Such an approach would compute the &quot;correct&quot; referents in such refering expression. The refering expression itself is not sufficient to establish such a contrast. If the user intends to refer to the first two triangles of figure 2.1.a, he would probably prefer an expression such as the two leftmost triangles or these triangles together with a &quot;peripheral&quot; designation as we will describe it latter. Our claim about the necessity of a perceivable discrimination seems in accordance with what Robert Dale (Reiter and Dale 1992, Dale 1995) observes about referential expression generation. Just as we do, he argues that the relevance of a refering expression does not only rely on its ability to filter a unique referent but also cn its ability to establish a contrast in a contextual set of objects.</Paragraph> <Paragraph position="8"> Examples 2.1.b and 2.2.b rely on an already accessible discrimination based upon spatial cohesion. In such a case, the definite referring expression (the leftmo~t triangle) directly maps the spatial discrimination. Such show that any percepually based discrimination (we could as well have a contrast in size, texture or whatsoever) is sufficient to justify the leftmost triangles (these triangles + gesture respectively) 3. If a dialogue system has to understand such expressions as those we mentioned so far, h e therefore should perceive its environment on a more &quot;user compatible&quot; basis. We suggest at least that perceptual contrasts should be taken into account in order to structure the set of visible objects.</Paragraph> <Paragraph position="9"> When no contrast pre-exists which would directly support an intended reference, we mentioned the possibility to build a group on the basis of individuals by such an explicit expression as &quot;the two leftmost triangles&quot;. A corresponding solution in terms of demonstrative use would be something like this triangle and this one (or these two triangles) together with two pointing gestures. Another solution consists in building the contrast by means of a &quot;peripheral designation&quot; which justifies our claim about considering perception and designation on a unified constrastive basis. In order to argue that claim, we will now re-consider gestures almost independantly from the referential expressions they accompany.</Paragraph> <Paragraph position="10"> 4. Gesture and contrast Our analysis of demonstrative and definite NPs (when referring within a perceptual environment) relies on perceptually founded contrasts. The required precision of a designation gesture therefore depends upon these perceptive constrasts. In such an example as: relies on the same kind of horizontal discrimination: the only difference with the preceeding example is that we refer to a cohesive group. The horizontal discrimination identifies here two groups from which gesture only has to select one. However, such situations as: do not provide any perceptual grouping of something which would correspond to &quot;the upper and the righmost triangles in the left group of three&quot;. If the user intends to refer to these two triangles, he has to build a discrimination into the group. A possible gesture to do that is depicted bellow: Such &quot;peripheral&quot; designations take up for the absence of a shared perceptual feature (such colour), as it both gather up the two objects and put them into focus. The analysis of the whole intervention (the gesture plus the NP &quot;these triangles&quot;) is then of the same kind as its equivalent in 2.3.b. As such, we clearly see here that gesture, instead of just being another mtxte of communication, \[\] pertains to the same domain as perceptual information. \[\] Our analysis so far can thus be summarized as &quot; the gesture only has to separate the two triangles along the horizontal, since the perceptive contrast relies upon a separation of the objec~ on that direction. No strict inclusion of the pointing into the left triangle is required. The situation depicted below 3 In some cases, the speaker has the possibility to elicit the contrasfive feature. E.g. The l~lack triangles (fig.2.3.b) follows: * a contrast based on the category has to match a perceptive contrast in the case of simple definite Nps, thus meaning that perceivable triangles should be considered when analyzing the triangle(s) * a contrast based on saliance has to match a perceptive contrast in the case of demonstrative Nps. As we only considered in this paper demonstrative plus gestures, Constraints on the Use of Language, Gesture and Speech .for Multiraodal Dialogues 97 the required saliance is yielded by gesture itself * a spatial contrast has to match a perceptive contrast (not necesserally spatial) in the case of spatial definite NPs.</Paragraph> <Paragraph position="11"> The remaining problem is now to limit the context in which we consider these contrasts. There are expressions in dialogue corpora that cannot be properly understood if we do not take into account focusing phenomena as well as attentional contexts and visual capabilities. Moreover, as we will justify, in such reduced contexts, functionality associated to the objects considered may introduce specific orientations.</Paragraph> <Paragraph position="12"> 5. Localizing spatial references Having shown that any spatial --in the broad sense we want to advocate -- reference is based upon a structural organisation of a set of elements, we will now see how this very set plays a real role of contextualizing the referring process, with some consequences upon dialogue management. Indeed, all our examples so far were simple enough to imply that there was only one context in which to find the intended referent. On the contrary, i f we consider a more complex situation taken from a Wizard of Oz simulation in the domain of interior furnishing (Dauchy et alii, 1993; Mignot et alii 1993), we will see that our analysis should actually be drawn a step further. Figure 4.1 examplifies a typical situation that was presented to the user during the experiment, with an empty drawing room to be furnished using the presented elements.</Paragraph> <Paragraph position="13"> speaker and the hearer for a spatial reference to be understood, we can quickly see that this structure can only be inferred within a localized context which first limits its extension, but also subsumes its general characteristics such as the categories of objects, their perceptual or functional properties etc. Paradoxically, we could say that it is difficult to contrast objects having little or nothing in common as there would be no reason for a speaker to compare them in any way. Besides, such contexts seem to have a certain amount of stability during a dialogue, as can be seen in the following example associated with figure 4.2:</Paragraph> <Paragraph position="15"> furnishing scenario.</Paragraph> <Paragraph position="16"> Following the observation that there should be a prior structure shared by both the Here, it appears that the spatial reference in the second utterance is not computed globally on the visualized scene but upon a sub-space resulting from the interpretation of the first utterance and thus centered on the sofa. Such a sub-space is characterized by its spatial inclusion within that of the drawing room, hut also by the different characteristics (especially functionnal ones) of the objects i t contains.</Paragraph> <Paragraph position="17"> At this stage we can thus caracterize a spatial referring operation as a double system of vertical and horizontal relationships within a context which encompasses the object which is being referred to, but also the set of alternatives which being stated either explicitaly or implicitaly during the current 98 B. Gaiffe and L. Romary referring act or the rest of the dialogue. Figure 4.3 summarizes these different constraints for a reference to object O1 within a context C, the alternatives being reduced here to a single object 02.</Paragraph> <Paragraph position="19"> In this schema, f is the functionnal link (FL) which unites the different objects to their current contexP, and.which has to be shared by the whole set of alternatives. R is the contrasting relation which allows the interpreter to isolate the referent of the expression from the other members of the set of alternatives. The very existance of both C and f are supposed to be implied by the nature of the referring expression. Similarly, the relation R is constrained by the type of the expression and can be further specified as follows:</Paragraph> <Paragraph position="21"> In this last case, the referential expression is usually explicit (e.g. the leftmost armchair) about the actual contrasting relation and thus makes the presence of alternatives all the more obvious. For the two other cases, it is usually through the following utterances that, as we have seen, we can justify the presence of the set of alternatives. As a matter of fact, one consequence of constraining a referential operation to a localized frame is that these have a certain amount of stability from utterance to utterance.</Paragraph> </Section> class="xml-element"></Paper>