File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-1608_metho.xml
Size: 22,515 bytes
Last Modified: 2025-10-06 14:09:58
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-1608"> <Title>Incremental Generation of Multimodal Deixis Referring to Objects</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Incremental Multimodal Content Selection </SectionTitle> <Paragraph position="0"> We integrate in the incremental algorithm by Dale and Reiter an evaluation of the spatial property location, either to be uttered absolutely by a pointing gesture or to be expressed verbally in relation to other objects in speaker-intrinsic coordinates. null Before presenting the algorithm we first have to clarify the terminology used. Analogous to [Dale and Reiter, 1995], we define the context set C to be the set of entities (physical objects in our scenario) that the hearer is currently assumed to be attending to. We also define the set of distractors D to be the set of entities from which the referent r has to be distinguished further on. At the beginning of the content selection process the distractor set D will be the context set C except the referent r; at the end D will be empty if content selection was successful. R represents the set of restricting properties found, each composed of an attribute-value pair.</Paragraph> <Paragraph position="1"> P represents the ordered list of properties which the algorithm gets as additional input. Based on observations in our data we assume that referring to objects by pointing is the first choice in face-to-face dialogues, while expressing relative location is only used after basic properties like type or colour. Therefore, we get absolut location, type, colour, size, and relative location to be the list of properties which have to be evaluated concerning their discriminatory power.</Paragraph> <Paragraph position="2"> The incremental content-selection in our algorithm (see Alg. 4.1) is organised in two main steps: First, see part (i), disambiguation of the referent by pointing is checked if the referent is visible for both participants. The decision, which kind of pointing, object-pointing or region-pointing, is appropriate is based on an evaluation of their discriminatory power. Object-pointing can only be used if the gesture is able to indicate the referent in an unambiguous manner.</Paragraph> <Paragraph position="3"> This is tested by generating a pointing cone with an apex angle of 12 degrees anchored in an approximated hand-position (covered in the functions GENERATEPOINTINGRAY(r) and GETPOINTINGMAP((vectorh,vectorr),C,a) with the apex angle a). If only the intended referent r is found inside this cone, the algorithm terminates and referring can be done by objectpointing. Otherwise, region-pointing is evaluated using the same functions to narrow down the distractor set D to the objects found in the cone, now with the wider apex angle b.</Paragraph> <Paragraph position="4"> For determining additional discriminating properties (see part (ii)) we use an adapted version of the incremental algorithm of Dale and Reiter described above. Each property p in P is evaluated concerning its discriminatory power. If it rules out some objects in D, these objects are deleted in D and p and its value v are added to R.</Paragraph> <Paragraph position="5"> On the one hand we extend the original algorithm accounting for properties which are expressed in relation to other objects in the scene. On the other hand our algorithm is simplified in as much as in our prototypical implementation the FINDBESTVALUE function defined by Dale and Reiter is replaced by the cheaper function GETVALUE. We realise the search for the appropriate value on a specialisation hierarchy only for the special case type (&quot;screw&quot; instead of &quot;pan head slotted screw&quot; is used). If an appropriate value for type does not exist (this is the case for some aggregates under construction in our domain), type is uttered in an unspecific manner like &quot;this part&quot;, the value v for the property type is then set to object, the most general value in the specialisation hierarchy. Analogous to [Dale and Reiter, 1995], type is added to R even if it has no discriminatory power. This complies with the most frequent kind of over-specification found in our empirical data.</Paragraph> <Paragraph position="6"> For the other properties like colour we do not need such a sophisticated search on a specialisation hierarchy in our domain. We operate in a highly simplified domain with objects characterised by properties having only a few and well distinguished values perceivable by both dialogue participants.</Paragraph> <Paragraph position="7"> For the property colour, e.g., only the values red, green, blue, yellow, purple, orange, and brown exist.</Paragraph> <Paragraph position="8"> In the following we describe the realisation of the essential modifications proposed in our approach in greater detail, the evaluation of the discriminating power of pointing and the consideration of relational properties.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Considering the Spatial Context: Object-pointing vs. Region-pointing </SectionTitle> <Paragraph position="0"> If we assume that the spatial context of the interaction determines the discriminatory power of pointing as described in Section 2 we have to anchor multimodal content-selection into this context. The central concept for this task is the pointing cone. It models the region which is indicated by the pointing gesture. The objects inside the cone can not be distinguished without further information.</Paragraph> <Paragraph position="1"> In the course of our multimodal content-selection algorithm the generation of the pointing cone and the identification of the objects lying inside it is realised using the following functions: * REACHABLE?(r): Tests if the referent r is visually available to both dialogue participants.</Paragraph> <Paragraph position="2"> * GENERATEPOINTINGRAY(r): This function gets the referent r and computes a pointing ray which is represented by two vectors, its originvectorh located in the demonstrating hand and its direction vectorr determined by the referent r.</Paragraph> <Paragraph position="3"> * GETPOINTINGMAP((vectorh,vectorr),C,a): This function (for de- null tails see Alg. 4.2) gets the pointing ray (vectorh,vectorr), a set of objects C, and an apex angle a and returns a sorted list of objects located inside the cone defined by (vectorh,vectorr) and a. The decision criterion is the apex angle a. If the vector originated invectorh directed to o [?] C spans with the pointing ray an angle less than a o is said to be located inside the cone, otherwise not.</Paragraph> <Paragraph position="4"> * GETPOSITION(o,vectorh): Computes the position of object o w.r.t. the position represented byvectorh, in this case the hand position.</Paragraph> <Paragraph position="5"> * GETANGLE(vectorx,vectory): Computes the angle between the vectors vectorx and vectory.</Paragraph> <Paragraph position="6"> * INSERT(o,M,a): Inserts the object o in the map M in increasing order w.r.t. the angle a.</Paragraph> <Paragraph position="8"> In the course of evaluating pointing, it is tested first whether the referent is reachable by both participants. In our application domain this implies whether r is a visible object lying on the table, the construction area. If this is the case, pointing in general is appropriate, the property location with the value arrowsoutheast indicating a pointing gesture is added to the list of restricting properties R.</Paragraph> <Paragraph position="9"> To decide whether object-pointing or region-pointing is appropriate, the pointing cones for these two kinds of pointing have to be generated. This is achieved by generating the pointing ray first using the function GENERATEPOINT-INGRAY. To determine the origin of the pointing ray without synthesising a pointing gesture at this early point of time an approximated hand position is computed located in a typical distance in front of the body on a straight line between a point in-between the shoulders of the demonstrating agent and the referent r.</Paragraph> <Paragraph position="10"> The pointing ray is used as an input for the function GETPOINTINGMAP which stores all objects inside the cone in a sorted map. First, this is done for a cone with the apex angle a, the cone for object-pointing. If this map contains at least one object besides the referent r, disambiguation based only on a pointing gesture is not possible. Region-pointing is then chosen to narrow down the set of distractors. Again the function GETPOINTINGMAP is used to determine the set of objects which are indicated by pointing, now by regionpointing. The wider apex angle b for the pointing cone of region-pointing is used to ensure robust reference.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Relational Object Properties </SectionTitle> <Paragraph position="0"> In our corpus we often found properties which are typically expressed in relation to other objects. The most frequent examples concern the properties size and location leading to descriptions like &quot;the big object&quot; respectively &quot;the left object&quot;. The function RELATIONALPROPERTY?(p) tests for each property p if it is a property which can be expressed relationally. To evaluate these properties we use the function GETRELATIVEVALUE. This function (see Alg. 4.3) compares the absolute value of the referent's property p with the corresponding values of the objects in D. If the referent r holds the maximum or minimum of the values the function returns the according max or min value, e.g., big or small if the property is size. To do so, GETRELATIVEVALUE needs a partial order for each property. In our system this is implemented for size and relative location.</Paragraph> <Paragraph position="1"> In the case of size we relate the property to the shape of the objects under discussion. Shape is a property often used on its own if the type of an object is unknown but it is difficult to handle in generation because the description of shape, especially for complex shapes, is highly ambiguous and subjective. However, in our corpus data aspects of shape can be often found as part of descriptions of size. This can be found if the shape of an object is characterised by one or two designated dimensions. For these objects size is substituted by, e.g., length respectively thickness (&quot;long screw&quot; is used instead of &quot;big screw&quot;).</Paragraph> <Paragraph position="2"> In the case of relative location we use a similar kind of substitution. The relative location is evaluated along the axes defining the subjective coordinate systems of the dialogue participants (left-right, ahead-behind, and top-down). E.g., GETRELATIVEVALUE returns left if the referent r is the left-most located object in D [?]{r}.</Paragraph> <Paragraph position="3"> The function GETVALUE(o,p) returns the absolute value v of the property p of the object o fetched from the knowledgebase. The search for an appropriate value on a specialisation hierarchy for the property type, as described above, is realised within this function.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Example </SectionTitle> <Paragraph position="0"> The following example illustrates the process of content-selection as it is realised by the described algorithm (Fig. 2): The starting point is a query concerning the reference to a specific object with the technical name five-hole-bar-0 (Fig.</Paragraph> <Paragraph position="1"> 2a). This object lying on the table is visible to both dialogue participants, therefore pointing is appropriate and the prop-erty location with the value arrowsoutheast indicating a pointing gesture is added to R. Now it has to be decided which kind of pointing is appropriate (Alg. 4.1, part (i)), that means whether pointing alone (object-pointing) yields the referent in an unambiguous manner. To do so, the pointing cone for object-pointing is generated. In this example the object density is high and more than one object is found inside this cone. Therefore, pointing alone does not yield the referent and region-pointing is evaluated next. This is illustrated in Fig. 2b) schematically: The two ellipses mark the intersection of the pointing cones with the table, the smaller ellipse w.r.t. object-pointing, the bigger one w.r.t. region-pointing. The smaller ellipse covers two objects, that means pointing alone can not distinguish between these two objects, an additional definite description is needed. Region-pointing is used to narrow down the set of distractors C for the construction of the definite description.</Paragraph> <Paragraph position="2"> To make the multimodal reference consisting of pointing and definite description more robust (in analogy to the empirical findings) now a wider apex angle is used resulting in the bigger ellipse. The objects inside this bigger ellipse, the two bars five-hole-bar-0 and three-hole-bar-0, a block, a screw, and a disc constitute the distractor set.</Paragraph> <Paragraph position="3"> The second part of the algorithm determines the properties needed for the definite description. It starts with testing the property type. The type five-hole-bar is too specific, so the super-type bar is chosen. This property rules out all objects except the two bars (now C = {five-hole-bar0, three-hole-bar-0}) and type with the value bar is added to R. The property colour is tested next; it has no discriminatory power concerning the two bars. But the following property size discriminates the two objects. The shape of bars is characterised by one designated dimension. Therefore, size is substituted by length. In our case the referent r has the maximum length of all objects in C, the property length with the value long is added to R. Now C contains only r, the algorithm terminates and returns R = {(location, arrowsoutheast),(type,bar),(length,long)} (Fig. 2b).</Paragraph> <Paragraph position="4"> Based on R, a pointing gesture directed to r is specified, the noun phrase &quot;die lange Leiste&quot; (the long bar) is generated, and both are inserted into an utterance template (see Fig. 2c)). The complete utterance is synthesised and uttered by the agent Max (Fig. 2d).</Paragraph> <Paragraph position="5"> 6 Application in the context of Human-Computer Interaction in VR As explained in the introduction, the described approach was developed in the context of research on interfaces for natural interaction with an anthropomorphic agent in VR. The embodied agent Max should be enabled to produce believable deictic references to virtual objects in real-time interaction. Following [Dale and Reiter, 2000], the generation of natural language can be divided into three main steps, namely, macroplanning (document planning), microplanning, and surface realisation. Extending this, we add synthesis as a fourth step, including motorplanning and visualisation for gestural and a text-to-speech synthesis for verbal utterances.</Paragraph> <Paragraph position="6"> Content-selection for complex demonstrations is part of microplanning. The starting point is a logical representation of the performative of a planned utterance (as illustrated in the example above, see Fig. 2a)), which will be provided as result of the reasoning processes of the agent in future work.</Paragraph> <Paragraph position="7"> The results of the content selection as represented by a list of attribute-value-pairs are fed into a surface realisation module generating a syntactically correct noun phrase. This noun phrase is combined with a gesture specification and both are inserted into a template of a multi-modal utterance fetched from a database and described in MURML [Kranstedt et al., stration in four steps: a) A query concerning the object fivehole-bar-o constitutes the starting point; b) pointing cones for object-pointing and region-pointing are generated, the latter one specifies the distractor set for further property evaluation; c) the pointing gesture and the noun phrase are inserted in an utterance description template described in MURML; d) an appropriate animation (German speech, here with the visualised pointing cone) is synthesised.</Paragraph> <Paragraph position="8"> 2002] (see Fig. 2c) for illustration). MURML enables the specification of arbitrary co-verbal gestures. Cross-modal synchrony is established appending the gesture stroke to the affiliated word or sub-phrase in the co-expressive speech.</Paragraph> <Paragraph position="9"> Based on these descriptions, an utterance generator synthesises continuous speech and gesture in a synchronised manner (for details see [Kopp and Wachsmuth, 2004]).</Paragraph> <Paragraph position="10"> The VR environment in which the interaction takes place is realised using the framework Avango [Tramberend, 1999] which is based on the common scenegraph representation of virtual worlds. With PrOSA (Patterns On Sequences of Attributes, [Latoschik, 2001] this framework was extended for interacting in immersive virtual reality by means of speech and gesture. The scenegraph is not only used to model the environment, it also builds the agent's knowledgebase of its environment. Each object represented in the scenegraph can be correlated with a so-called semantic entity [Latoschik and Schilling, 2003], which provides arbitrary semantic properties associated with this entity. During content-selection, the property values of the objects under discussion are fetched from these semantic entities.</Paragraph> <Paragraph position="11"> The vocabulary used is geared to the ontology of the toykit, called Baufix, we use in our setting. It consists of a small number of generic parts like bars, screws, blocks, discs etc.</Paragraph> <Paragraph position="12"> (twelve different types, some of them in different size and colour). All the parts and the values of their properties can be named. Therefore, all possible descriptions in this small domain can be generated. Currently, deictic expressions as part of different types of speech acts can be generated, especially query, request, and inform. Only a small number of verb phrases can be used. In sum, the vocabulary currently available is very small. However, the focus of this work is not to generate a huge amount of speech output but to investigate the correlation between speech and gesture in the generation of multimodal reference.</Paragraph> <Paragraph position="13"> Up to now we can generate in the course of deictic expressions pointing gestures synchronised with speech for all objects reachable for the agent without moving. In most cases moving will not be necessary, respectively more costly than generating a definite description. But we know that this is not adequate in all cases. The integration of moving in the course of content-selection will be an issue of future work.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Conclusion </SectionTitle> <Paragraph position="0"> In this paper an approach was presented which enables the generation of multimodal deictic expressions consisting of a pointing gesture indicating the location of an object and a definite noun phrase describing the object using its properties.</Paragraph> <Paragraph position="1"> Taking account of the inherent impreciseness of pointing gestures two referential functions of pointing are distinguished, object-pointing and region-pointing. With the increasing distance between demonstrating agent and referent the discriminatory power of the gesture decreases and more additional properties are needed to identify the referent. A pointing cone for each referring function of pointing gestures was defined to model the distance dependency of pointing. An algorithm was presented that integrates pointing and definite descriptions by using the objects highlighted by the gesture as distractor set for the construction of the definite description.</Paragraph> <Paragraph position="2"> Drawing the attention to a spatial region and the objects lying inside this region region-pointing ensures that these objects are in the focus of attention of the addressee ([Dale and Reiter, 1995] speak in this context about a navigational function of the expression).</Paragraph> <Paragraph position="3"> Dale and Reiter emphasise that their content-selection algorithm is defined domain independently while the prop-erty list P and the functions MORESPECIFICVALUE, BA-SICLEVELVALUE, and USERKNOWS define the interface to the domain of application, especially to the knowledge about this domain shared by the interlocutors. Analogously, the functions REACHABLE?, GENERATEPOINTINGRAY, and GETPOINTINGMAP in our approach can be seen as a link between the content-selection algorithm and the spatial context in which the interaction takes place. Implementing the concept of the pointing cone they provide an interface between the geometrical aspects of pointing gestures and their referential semantics.</Paragraph> <Paragraph position="4"> The quality of the generation results using the described approach depends on the precision of the topology of the pointing cones and the knowledge about the parameters influencing this topology. We have started to conduct empirical studies using tracking technology to collect analytical data concerning the pointing behaviour of human subjects in varying pointing domains [Kranstedt et al., 2005].</Paragraph> <Paragraph position="5"> Up to now, we do not have a comprehensive evaluation of our approach. But if we compare the generation results with the empirical data collected in the demonstration games mentioned in Sec. 1 and with other corpora about instructorconstructor dialogues in the Baufix-world [Sagerer et al., 1994] we notice a good correspondence with the empirical findings. A critical point we found in these comparisons is that the perceivable resolution of pointing in real world is not exactly the same as in VR. In the latter it depends massively on kind and quality of the display technology used. Therefore, mechanisms which adapt the pointing cone's size and form to the constraints of the interaction environment seem to be useful.</Paragraph> </Section> class="xml-element"></Paper>