File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/c88-2122_metho.xml
Size: 22,263 bytes
Last Modified: 2025-10-06 14:12:13
<?xml version="1.0" standalone="yes"?> <Paper uid="C88-2122"> <Title>Generating Multimodal Output - Conditions, Advantages and Problems</Title> <Section position="4" start_page="584" end_page="587" type="metho"> <SectionTitle> 2. Deixis in natural communi- </SectionTitle> <Paragraph position="0"> cation situations Deictic reference occurs in dialogs as well as in texts. In both situations, the objects referred to can be linguistic entities (sentences, chapters eto.) or non-linguistic objects (cats, tables etc.), For the following considerations, only those types of deixis are relevant which specify non-linguistic entities. They can be performed by combining linguistic expressions with extralinguistic devices.</Paragraph> <Paragraph position="1"> D/e/ogs are characterized by the possibility of turn-taking. If both participants are present, they can specify elements of their common visual world by combining deictic expressions and body movements, mainly peinting gestures, If a speaker can point to objects, s/he can use shorter, ~impler and even referentially insufficient descriptions. In particular, pointing facilitates reference if the speaker doesn't know how to describe the object in question, One example is the utterance THIS \[~'\] is broken.</Paragraph> <Paragraph position="2"> while pointing at some part of the engine of one's car 2.</Paragraph> <Paragraph position="3"> Successful reference by pointing has some preconditions, for instance the receiver's visual attention. S/he has to face the speaker in order to notice his/her gesture and then has to follow this gesture with his/her gaze. The first step can fail by visual inattentiveness, the latter by wrong direction of gaze. Feedback is received by speakers via two channels. On the one hand, a speaker controls the nonverbalreact/on of the receiver and can therefore immediately request attentiveness or correct a wrong direction of gaze. On the other hand, s/he gets delayed feedback by the verba/reaction.</Paragraph> <Paragraph position="4"> Communication by text normally implies a spatial and temporal dissocation of sender (-writer) and receiver (=reader). Therefore, the sender can deicticel/.y refer only to non-!inguistic entities which are visible also for the receiver. This condition is fulfilled if the text is combined with non-linguistic representations (pictures, diagrams, maps etc.). In these cases, the sender can refer to elements of this 'visual context' by combining linguistic expressions and extralinguistic means (arrows, indices etc.). The latter represent a functional equivalent to pointing gestures within dialogs and have the same advantages. But, like the text itself, they don't require attentiveness on the reader's side during the period of their production.</Paragraph> <Paragraph position="5"> 3. Deixis in NL dialog systems The type of dialog considered here is a consu#at/on o~;~/o#:' the system (= expert) assists the user (- non-specialist) in filling out his/her tax form. The system has not only more expert knowledge about the domain, but also more knowledge concerning content and structure of the graphics displayed on the screen.</Paragraph> <Paragraph position="6"> Due to these differences in knowledge, the anaO/sis component has to deal with shortcomings in the user's input. His/ber pointing gestures may be imprecise because s/he doesn't know the structure of the presented graphics. Ignorance of technical terms results in inadequate descriptions. In theses cases, additional knowledge sources are needed for referent identification, e.g. case frame analysis and dialog memory (/AIIgayer, Reddig 86/, /AIIgayer et al. 88/). In contrast, the genera/ion component can always produce precise pointing gestures as well as exact descriptions. But the latter capability may be in conflict with the task of generating system reactions which are communicatively adequate. If the user doesn't know certain technical terms, then the combination of underspecitied description and precise gesture is more comprehensible than a totally specified description.</Paragraph> <Paragraph position="7"> 2) Pointing gestures are represented by the sign ' \[1~>'1 ' Capital ~etters Ihlghllght the correlated phrese.</Paragraph> <Paragraph position="8"> Another problem is the different perceptual capabilities of user and system. Humans are 'multichannet systems' which receive information about objects through a great variety of channels. In contrast, the perceptible world of all systems developed to date is only a small subset of the user's world. Normally, systems with more general application domains are only able to process textual and graphical input. In particular, these systems cannot &quot;see&quot; the user's nonverbal behavior and therefore cannot request attention if necessary. Also, wrong user reactions cannot serve as an indication of his/her visual inattentiveness, because they can be caused by several other factors.</Paragraph> <Paragraph position="9"> For example, It might be the case that the user has correctly identified the field In question but enters a wrong amount because s/he has confused some technical terms. During natural pointing, the sound which occurs when the speaker touches the form may cause the hearer to pay attention to his/her gestures. But in the case of simulated pointing, the generation of a specific audible signal in parallel to each pointing gesture implies a rather &quot;unnatural&quot; situation.</Paragraph> <Paragraph position="10"> The design of multimodal interfaces is one central topic of recent research. It has to be emphasized that the term 'multimodal input/ output' covers a great variety of heterogeneous phenomena from the manipulation of simulated objects within an &quot;artificial reality&quot; (e.g. the D#taG/ove, see/Zimmerman et al. 87/) to the use of different pointing devices.</Paragraph> <Paragraph position="11"> The goal 'multimodal referent specification' can be achieved by various strategies. If one wants to s/~nu/ate nstuf#/po/'f/t/\[Ig, the pointing device should correspond to natural gestures. A touch-sensitive screen allows highly natural gestures, but pointing by means of a so-called 'mouse cursor' can also simulate some aspects of natural pointing. The latter strategy is chosen in the XTRA system. If, in contrast, one wants to offer function#/equ/v#/ents, there exists a great variety of devices. It is possible to adapt the extralinguistic deictic means which occur in texts, e.g. arrows and indices. Furthermore, the computer offers several specific devices, which have no model in natural pointing, such as Iraming, highlighting or inverting the referent. The choice depends on several factors, for example which types of objects are to be referred to.</Paragraph> <Paragraph position="12"> 4. Form deixis in XTRA The given visual context of the XTRA system is the form displayed on the screen. In order to specify its elements, several types of pointing actions occur ( cf./AIIgayer 86/,/Schmauks 86a, 86b, 87/): * Punctua/po/nt/ngindicates one singular point on the form and can be produced in order to specify primitive objects, i.e.</Paragraph> <Paragraph position="13"> individual regions and individual entries. Another possibility is the reference to a complex region by pointing to a part of it Coam-pro-toto de/x/s).</Paragraph> <Paragraph position="14"> * During non-punctua/po/nt/n#, the pointing device performs a complex motion, e.g. underlines an entry or gives the borders of a larger region.</Paragraph> <Paragraph position="15"> * Mu#/plepoin//ngmeans, that one utterance is accompanied by more than one pointing gesture. These complex pointing actions specify elements of sets, for example several instances of one concept.</Paragraph> <Paragraph position="16"> One aim of XTRA is the use of multimodal referent specification techniques in input as well as in output. Mu#imoda/input is performed by combining typed NL descriptions and simulated pointing gestures.</Paragraph> <Paragraph position="17"> The latter are currently realized by means of a mouse cursor. They simulate natural pointing with regard to two aspects: the user can select the accuracy of gesture, and the relation between the gesture and the object referred to depends on context /AIIgayer 86/. For example, if the user points at a region which is already filled out, descriptor analysis is necessary in order to decide whether s/he refers to the region itself or to its actual entry.</Paragraph> <Paragraph position="18"> The generation component has to reckon with different problems concerning pointing actions. If it also realizes gestures by movements of a mouse cursor, their perception may be hampered by the user's visual inattentiveness. In the case of multiple pointing, for example, s/he might fail to notice one of the pointing gestures and consequently may not identify the referent. This causes the whole utterance (e.g.</Paragraph> <Section position="1" start_page="584" end_page="584" type="sub_section"> <SectionTitle> 5.1 Architecture of POPEL </SectionTitle> <Paragraph position="0"> The task of POPEL, the natural language generation component Of XTRA, is to select and verbalize those parts of the conceptual knowledge base that are to be uttered. The structure of the component follows the well-known division into a &quot;what-to-say&quot; and a &quot;howto-say&quot; part/McKeown 85/: POPEL-WHAT, which selects the content, and POPEL-HOW, which verba//~esit (cf./Reithinger 87b/). Contrary to most other systems, the information flow between these two sub-modules is not unidirectional from the selection part to the verbalisation part. Rather, both parts communicate while processing the output of the system (cf./How 87/).</Paragraph> <Paragraph position="1"> A second essential feature of POPEL's architecture is the parallel processing approach to generation: the different stages of selecting end realizing the output proceed in a parallel cascade. In this way, it is possible to go ahead with the selection processes inside POPEL-WHAT, while a previously selected part of the utterance is already verbalized in POPEL-HOW. As one consequence, restrictions to the selection arising out of the verbalization process can be taken into account.</Paragraph> <Paragraph position="2"> Currently, a first prototype of POPEL is under development. The processor for the parallel cascade has already been implemented. The emphasis was placed on information propagation both upwards and downwards and on the definition of the syntax and semantics of the transition rules. The next step will be the encoding of knowledge within this framework. POPEL is implemented on a Symbolics 3640 Lisp machine running Zetalisp.</Paragraph> </Section> <Section position="2" start_page="584" end_page="587" type="sub_section"> <SectionTitle> 5.2 Pointing gestures as special cases of descriptions 5.2.1 Selection of descriptions </SectionTitle> <Paragraph position="0"> Selection of descriptions is one essential interaction point between the two components. Decisions which concern POPEL-WHAT are: * &quot;Givenness&quot; of an object: the description of an object depends on whether that object is known in the (implicit ot explicit) context of the user, In general, POPEL-HOW selects definite phrases for known objects and indefinite phrases for unknown objects, but the required knowledge as to &quot;givenness&quot; is stored in the user model which is accessed by POPEL-WHAT.</Paragraph> <Paragraph position="1"> * &quot;Pointability&quot; of an object: the so called 'form hierarchy' represents the structure of the form. It links the regions of the form to the respective representations in the conceptual knowledge.</Paragraph> <Paragraph position="2"> If an object is selected for verbalization, the link from the concept of the object to the form hierarchy provides the information that a pointing gesture can be generated.</Paragraph> <Paragraph position="3"> * Situation-dependency of a description: the contextual knowledge bases contain structure and content of the previous dialog. They allow the determination of differently detailed descriptions, depending on the current context, If necessary, meta-communicative or text-deictic attributes can be added.</Paragraph> <Paragraph position="4"> POPEL-HOW makes the following decisions: * Generation of a description: whether an object in the conceptual knowledge base is to be realized as a description depends on the language-related structure that has already been determined.</Paragraph> <Paragraph position="5"> * Language-dependent constraints: the possible surface structures remaining for a description depend on the extent to which the sentence has already been verbalized. In German, for instance, it is hardly possible to generate a pronominal NP if there is already a lexical NP or PP after the finite verb and the pronominal NP is to follow this phrase (cf./Enge182/).</Paragraph> <Paragraph position="6"> The sequence of these decisions is intertwined. For example, the inquiry of POPEL-WHAT, as to whether an object is available in the context makes sense only after POPEL-HOW has decided to generate a description at all (for an outline see/Reithinger 87a/).</Paragraph> <Paragraph position="7"> From the viewpoint of an NL dialog system, pointing actions are descriptions which are accompanied by a pointing gesture. They focus the user's visual attention and can therefore localize visible objects. In the XTRA domain, pointing actions can refer to three types of objects: * A form region, e.g. 'You can enter your donations HERE Ion\].' * An entr2;, e.g. 'THESE 350 DM \[~\] are travel expenses:' * A correlated concept, e.g. 'Can I deduct SUCH DONATIONS \[~\]?' All elements of the form are in the shared visual context; therefore, they can be referred to by definite descriptions. No serious problems arise if an utterance is accompanied only by one pointing gesture. In contrast, the simulation of multiple pointing requires further considerations (cf. section 4) and has therefore not been treated in this paper. If the system's reaction contains more than one description which allows pointing, only one possibility will be realized. The others are reduced to purely verbal descriptions. The sentence (1) for example allows the reductions (la) and (t b): (1) THIS AMOUNT \[~'>.\], you have to enter HERE \[\[~','\]. (la) The donations of ISDM, you have to enter HERE \[~.'\].</Paragraph> <Paragraph position="8"> (lb) THIS AMOUNT \[Q:~'\], you have to enter///the//he &quot;donations', Because sentence generation is performed incrementally, POPEL-WHAT doesn't know the whole content of the utterance at the moment it has to decide whether to use a pointing gesture or not. Therefore, the decisions have to be based on heuristics and may be &quot;suboptimal&quot;. One of these heuristics is: do not use a pointing gesture if the object in question can also be specified by a short referential expression, for example a pro-word. Then the pointing gesture remains available to reduce a complex description if it follows in the same utterance. Following the simulation-oriented strategy of XTRA, pointing gestures are realized by positioning a mouse cursor on the screen. This is a close approximation of the type of movements a human performs when pointing with his/her finger. Furthermore, different degrees of accuracy are simulated by different shapes of the cursor. POPEL performs the pointing gesture parallel with verbalizing the correlated phrase and presenting it on the screen.</Paragraph> <Paragraph position="9"> 5.2.3.1 Punctual pointing gestures During a punctualpointing gesture, the cursor doesn't move on the form. This type of gesture is used both for the localization of primitive objects as well as for pars-pro-toto deixis. Because a gesture can refer either to a field of the form or to its content (i.e. a string in our domain), the linguistic information (e,g. 'this field' vs. 'this amount of money') has to disambiguate between these possibilities. A hand which holds a pencil is used as the symbol for this type of gesture (see figure 1/symbol A). The exact position depends on the type of the object. The default strategy is as follows: if the pointing action refers to a field, the pencil is in the middle of the fie/d, if it refers to an entry, the pencil is below the entf~, so that the symbol doesn't cover it. Additionally, the user model takes effect: if the user requested another position of the gesture repeatedly (e.g. 'Take away the finger, I cannot read thatl'), the pointing strategy has to be changed.</Paragraph> <Paragraph position="10"> Each time the speaker-hearer roles are reversed, the current pointing symbol changes to a neutral symbol (i.e, the standard mouse cursor). In this way, the user's visual attention doesn't remain fixed to .the location of the last pointing gesture. If the system generates a new pointing gesture, it first changes the neutral symbol into the choosen pointing symbol. Then it moves the symbol to the new pointing location. This method mimics the functionality of the movements of the hand during natural pointing, which already direct the heater's visual attention to the target location.</Paragraph> <Paragraph position="11"> Furthermore, punctual pointing gestures are used to realize pars-pro-toto deixis, which refers to greater parts of the form. In this case too, the ambiguity of the gesture has to be compensated by linguistic information. In our domain, unambiguous descriptions are 'row' and 'column'. Ambiguous expressions like 'region' can be disambiguated by additionally =taming the referent, e.g. 'the region of DEDUCTIBLES'. Delayed pe=ception of a punctual pointing gesture doesn't hamper referent identification. The pointing symbol changes only when the user takes initiative in the dialog again. Until then, the information of the gesture remains visually available. There exists an equivalent in natural pointing: it might happen that a speaker leaves his/her forefinger extended, until the dialog partner recognizes the gesture.</Paragraph> <Paragraph position="12"> 5.2.3.2 Non-punctual pointing gestures Non-punctual pointing, for example the encircling of a whole area, poses much greater problems. After the movement of the cursor ceased, the actual cursor position indicates only the final point of the gesture. If the user was inattentive, s/he cannot reconstruct the course of movement. This loss of information can be partially avoided by providing exact descriptions.</Paragraph> <Paragraph position="13"> Standard candidates for non-punctual pointing actions are composite objects, for example rows, columns or larger regions. However, a non-punctual pointing gesture that has not been noticed does not deliver any more information than the combination of punctual pam-pro-toto de/N/s and an exact linguistic description.</Paragraph> <Paragraph position="14"> Non-punctual pointing gestures can be realized by various means. In a first release of POPEL, the gesture is performed with another symbol (hand with stretched-out forefinger, see figure 1/symbol B). The movement should be both &quot;natural&quot; as well as relatively precise. Further research has to evaluate POPEL's current strategy with respect to various features, for example the efficiency of the pointing strategy and its acceptance by the user.</Paragraph> <Paragraph position="15"> 6. Alternative concepts for multimodal input/output and future requirements In the case of non-punctual and multiple pointing actions, the possible inattentiveness of the user and the current &quot;blindness&quot; of the system may lead to a loss of infermation. This danger increases with the temporal complexity of the gestures. The usage of &quot;lasting&quot; pointing techniques would be one possibility of dealing with this problem. One strategy is to &quot;freeze&quot; the track of non-punctual pointing gestures. This is similiar to underlining or encircling with a pencil. The track remains visible on the form until the next change in dialog control. One can imagine two variants of this strategy: the first is the successive drawing of the line, which is simiUar to a human-made gesture. Also the drawing speed could be adopted from natural drawing. The second variant is to produce the whole line s/multaneous~ But this keez/ng method has the essential shortcoming that the additional lines muddle the screen. Therefore, the functionally similar but &quot;unnatural&quot; means of referent specification (framing, underlaying, blinking, inverting etc.) seem to be more advantageous. They preserve the form's structure since it is not blurred by additional lines.</Paragraph> <Paragraph position="16"> Furthermore, these methods specify form regions, i.e. rectangular objects, more exactly than circular lines. On the other hand, however, this framing approach cannot simulate the context-dependency of natural pointing.</Paragraph> <Paragraph position="17"> One unsolved problem remains to be emphazised: all the aforementioned methods a/onecannot solve the problems of multiple pointing. If the sequence of the gestures must be known in order to understand the utterance, the frames etc. have to be combined with additional means. One solution could be the adaptation of methods used in texts in order to refer to elements of graphics (e.g. indices, cf. section 2). A highly user-adapted generation of pointing actions would require the storage of information about pointing in the user model. On the one hand, these are facts about the user's pointing behavior, including frequency and accuracy of gestures and possible systematic deviations (e.g. pointing consistently beside or below the intended referent). On the other hand, the generation component has to take into account the user's reaction to the system's point/n# actions. If s/he repeatedly misunderstands such an action, the system has to modify its pointing strategy and switch to the fixation method or to the framing approach, for example.</Paragraph> </Section> </Section> class="xml-element"></Paper>