File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/94/c94-2133_metho.xml
Size: 18,220 bytes
Last Modified: 2025-10-06 14:13:42
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-2133"> <Title>Multi-Modal Definite Clause Grammar</Title> <Section position="3" start_page="0" end_page="832" type="metho"> <SectionTitle> 2 Multi-Modal Input Processing </SectionTitle> <Paragraph position="0"> Consider a query (.'xample to a nmlti-modal interface with a screen image like Figure. 1. A user states &quot;Can this, attach this,&quot; pointing at a picture on the screen and clicking the mouse during the first &quot;this&quot; and then choosing an item fl'om a lllenu during the second. The system must realize that the first point is to a specific autonaobile and the second is to the menu item &quot;CD player&quot;. After integrating the two mouse pointing events into the two &quot;this&quot; in the utterance, the system nmst create an internal representation of this query that conforms to SQI, specifications. \[n tiffs example, even if the order of the two mouse clicking events is opposite, the system Intlst generate the salne SQI, spcciiicaI.ion, but the interl>retation will I>e i\]l(>l'e dill|cult. In order to interpret such complex (:ombinatio,s of lmllti-modal inputs, the following requirement.s exist: (1) Modes should be interI)reted equally an<t indel)ende.ntly. In <:onventiomtl multi-modal systems, natural language mode plays a major role, aml other modes such as mouse input mode are auxilia.ry. Inl)uts of auxiliary modes are merged into <;orresl)onding natural language expressions iu a surl'ace level, and the merged natural language query is interpreted I>y conventionatl natural language parsers. Therefore, varie.l.y of accepte<l multi-modatl exl>r<'.ssions is very limited.</Paragraph> <Paragraph position="1"> llowever, If each tnode is treated wit, Is the same mansler as that of ssatsls'atl \]allgSH/ge IlSOde, SyllldtX assd s(,mantics of iulmts of each mode are (lefim~d with gramsBar forlnulat;ion. 'Fhus, ccmq)lex exl)rcsskms can l>e de fined declaratively and more easily (2) Mode int<~'rI)reta|;ion should be r<4'(!l'red to one another, lnl)uts or each mode should be interpreted independently. Ilowever, the interl)retatiol~ of such inputs should be referred I>y other lnode interl)retattions. There are alnbiguities which arc solved only by integrating partial interl)retabi<ms oF related modes. For examl>le, if user states &quot;tiffs car&quot;, l>oi~ttit~g at an object which is overlal>l)ed on the. car object., the alnhiguity of the object pointing must he solved by conHlaring (lie two mode interpretations.</Paragraph> <Paragraph position="2"> (3) Mode interpre, tai;icm should handle temI>oral inthrmation. Tetlq>oral iuformation of inputs, such as input arriving time, inl,erwd between two inl)uts, plays an important role to i,~terl)rct mull.i-modal inputs. Consider an exasnl>le that a user states &quot;\]low muc\[s is this car&quot;, and points at, a car i>icture a litt.le after the utterance. If tile interwd is three .scco~sds, the l>ointing event should be integrated with &quot;this car&quot; in the ut-. terance. Ilowever, if the ilH.erwd is three illi~sles, tile event should not I)e int.egraled.</Paragraph> </Section> <Section position="4" start_page="832" end_page="832" type="metho"> <SectionTitle> 3 MM-DCG Design l)ecisions </SectionTitle> <Paragraph position="0"> This section describes major design decisions made in developing MM-1)CG. Ih:eause MM-I)(X; is n superset of I)CG, everything possihle isl I)(!G is also possibhe in MM-I)CG. llowever, two major extensions are provided</Paragraph> <Section position="1" start_page="832" end_page="832" type="sub_section"> <SectionTitle> 3.1 ll.eceiving Multiple Input Streams </SectionTitle> <Paragraph position="0"> MM-I)CG cau receive arbitrary mind>ors o\[' different, input streams, while I)CG receives only ore!, I';ach mode is assigned an individual stl'ealll. Tlscrefore, a single grammar rule in MM-1)(:C, can allow the coexistence of grammatical categories in ditSwent modes, thus allowing for their integration. In addition, coa|.ext sensitiw~ inlbrnmtion can be inl.crclmnged among cattegories of different modes in a single rule. Figure 2 illustrates a multi-modal input processing luodule which accepts three independent streams.</Paragraph> </Section> <Section position="2" start_page="832" end_page="832" type="sub_section"> <SectionTitle> 3.2 Cal<:ulating the Instantiated Time of Grammatical Categorh,,s </SectionTitle> <Paragraph position="0"> Inputs of a single mode invariably have ordering relations among them. A parser like DCG uses such order relations to amdyze syntax, semantics, and pragmatics.</Paragraph> <Paragraph position="1"> h,pul.s of differe.nt modes, however, have no inherent ordering rehd.ions. Therefore, MM-I)CG requires tim att.achmelH: of both the beginning time and the end time to each individual piece or input data. MM-DCG automatically calculates the beginning time and the end tiuw of any lew4 of grammatical categories generated during Imrsing.</Paragraph> <Paragraph position="2"> MM-I)C(; translator automatically generates the code which calculates the beginsfing and end times of any body goal in at grammar rule. The translator generates two extra argnments to store the beginning time and end time into each head and body goals in MM-I)CG rules. The beg|truing time argument of the head is unified with the beg|truing time argulner, t of the first hody goal. The end time a.rgu,nent of the head is uniIied with the end time argument of the last body goal. Figm'e 3 shows the argtH\]lellt organization of noun_phrase rule.</Paragraph> <Paragraph position="3"> Thus, for example, if a noun phra~se category is instantiatcd by pa.rsing &quot;tile blue car&quot;, the beginning time of the instant|areal category becomes equal to tile begin,ring tilsle of &quot;the&quot;, and the end time of the category is equal to the end time of &quot;car&quot;.</Paragraph> <Paragraph position="4"> MM-DCG requires any input frolu every mode to have begimfing and end times. Thus, each item in an input sequence will haw; the following sl.ructilre: input(beginning-time, end-time, <actual input>) which means that the actual input was inputted frolll start-tlme and completed at. end-time. Adding of this time information is easy for ally of the SOl'l.s of till)ill. modes we are considering (i.e. speecll recognition, keyboard inputs, mouse 1)oint, ing, el.c).</Paragraph> <Paragraph position="5"> One other iml)orta.nt item of notation: \[l'a variable is explicitly bound within at goal, the variable ret.urus the beginning and end times of the goal hi the R)rlll of a finletor. Thus, Time^goal means that &quot;if goal succeeds, the beginnhlg time and end time of tile goal are rctnl'ned ill the wu'iable Time.&quot; Using the time iiflbrmation of instautiated categories, rule writers can define chronological collstra.ints aniong categories, for exaniple, the following descriptiotl expresses a constraint that pronoun category and pointing category nnist be both instantiated wittliu a five se(:-</Paragraph> </Section> <Section position="3" start_page="832" end_page="832" type="sub_section"> <SectionTitle> 3.3 Defining Timeout in I{.ules </SectionTitle> <Paragraph position="0"> Timeout is a constraint of intervals belween an input and its succeeding input of n. streanl (See Figure 4). If an interval between inputs of a st rean |hecomes larger than a threshold defined in gralluiu/r rllles, tile tinieout occurs, and tile streani is regarded C.llipl.y l.einporariiy although there still exist inputs in it..</Paragraph> <Paragraph position="1"> The following points rule llleaus that &quot;l/eceive i/louse clicking inputs wllile, tile interval between I.wo inputs is less thau 5 seconds or ilnti\] a stream I'Jecoines null, then return the list of the hlputs&quot;</Paragraph> <Paragraph position="3"/> </Section> </Section> <Section position="5" start_page="832" end_page="834" type="metho"> <SectionTitle> 4 Rules Written in MM~DCG </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="832" end_page="832" type="sub_section"> <SectionTitle> 4.1 Syntax </SectionTitle> <Paragraph position="0"> MM-DCG syntax extends I)CG in the following ways: * A body goal may o,' may not be specified its consmiting stream: Irl' a body goal consumes inputs from specific streams, the goal must be accompanied by the stream names. For example, tile following rule noun_phrase --> keyboard:pronoun.</Paragraph> <Paragraph position="1"> nieans that &quot;if the pronoun category is found which is generated by inputs from the keyboard stream, noun_phrase is found.&quot; If a body goal is not accompa,iied by any stream name, the goal is regarded as consunling sonic amount of inputs fi'om all modes.</Paragraph> <Paragraph position="2"> For example, the following rule noun_phrase --> noun.</Paragraph> <Paragraph position="3"> lneans that &quot;if the noun category is found which is generated by inpufos frorn certain streams, noun_phrase is found.&quot; * A terminal synibol should always be accompanied by a specific stream name: For example, the following rule pointing--> mouse:\[button(left, loc(X, Y)\].</Paragraph> <Paragraph position="4"> means that &quot;if a flmctor button(left, Ice(X, Y)) is found at the mouse strea.nl, pointing is found&quot;.</Paragraph> </Section> <Section position="2" start_page="832" end_page="834" type="sub_section"> <SectionTitle> 4.2 lime Example </SectionTitle> <Paragraph position="0"> To demonstrate how MM-I)CG rules are written, this section describes a simple grammar needed to handle &quot;object&quot; with multi-modal inputs.</Paragraph> <Paragraph position="1"> Figure 5 shows the definition of &quot;object&quot;. A rule writer defines existing slmeams specifically using a unit clause, active_stream/1. &quot;Object&quot; are specilied by using eitller one of the abow~' inodes or their combinations.</Paragraph> <Paragraph position="2"> The first object/l delhfition interprets natural language specifh:ations such as &quot;the blue car&quot;. The second object/l interprets a nlouse clicking which points at a.</Paragraph> <Paragraph position="3"> sl>ecific grai>hical object on the display. The third object/l definition izd.erprets a combination of a natural language utterance and a inouse pointing, such as stating &quot;the bhle car&quot; while pointing at a graphical object oil the display. A natural language utterance is interpreted at. the noun_phrase body goal, and the identified object is bound to Objl. A mouse pointing event is interpreted at the pointing body goal, and the identified object is bound to Obj2.</Paragraph> <Paragraph position="4"> Then, Objl and Obj2 are compared their values in a Prolog predicate enclosed inside curly brackets { and }. Both variables shonld be equal. If not, because the interpretation of noun_phrase or pointing must be wrong, bacld.racking occurs.</Paragraph> <Paragraph position="5"> As seen above, a single grammar rule in MM-I)CG can allow the coexistence of grammatical categories in different niodes, thus allowing for their integration. In addition, teniporal and context sensitive information can be interchanged aniong categories of different modes in a single rule.</Paragraph> <Paragraph position="6"> ~, stream definit:ion active_stream(speech, mouse, keyboard).</Paragraph> <Paragraph position="7"> ?, For natural language mode object(0bj) --> notm_phrase(\[Ibj).</Paragraph> <Paragraph position="8"> noun_phl-ase(Obj) -~> article, adjective(Att:t, A value), noun(Noun), {attribute(type, Noun, 0bj), attr~bute(Attr, A va\]ue, 0bj)}. article --> (speech or keyboard): \[the\].</Paragraph> <Paragraph position="9"> adjective(color, blue) --> (speech or keyboard):\[b\]ue\]. noml(automobile) ---> (speech or keyboaid) : \[car\].</Paragraph> <Paragraph position="10"> ~. For mouse mode object(Obj) =-> po~nting(Ubj).</Paragraph> <Paragraph position="11"> poillting({\]bj) -~> mouse: \[button(\]efL, lee(X, Y))\] ,{attribute(location, (X, Y), 0bj)}, ?, For combinations of modes object(Objl) --> noun phrase(t)bji), pointing(Obj2), {0bjl == 0bj2}.</Paragraph> </Section> </Section> <Section position="6" start_page="834" end_page="835" type="metho"> <SectionTitle> 5 Translating MM-I)CG into Prolog </SectionTitle> <Paragraph position="0"> This secl,ion describes lranslaLioll lcchniquos o\[' MM-I)C(; rules into Prolog i)redi('alcs, l&quot;irst, we explain the translat, ion method of I~IM.I)(:(; ruh!s with a single stream. Even in the single, strca.i cas~', MM-I)(?(; translation method is dill'err,hi from Ihal of I)( I( ',. Then, the ira.sial, ion tecludqu<e wit.h tlmlliph' Sll'eaHiS is CX pie|ned.</Paragraph> <Section position="1" start_page="834" end_page="834" type="sub_section"> <SectionTitle> 5.1 MM-DCG Transla|;ion for a Single Stream </SectionTitle> <Paragraph position="0"> A head and body goals i. a gra.mtar ride ar~, I re,slated into ;~ predicate with four exLra al'guntcl~l.s Lwo for i he beginning time mid l, he end l inle and Iwo tLr ~'xpressing a eOllSttllled ill\[)ll{, Si, l'i!alll. '\['h<! l;ll.\[l!r two al'gtlHtelllS are tim same its the gelleral,cd al'g/llllCIll,S W\]I(!II I)(I(~ ruie,% are translated into Ih'olog prmlicai.es.</Paragraph> <Paragraph position="1"> The beginning tinlc arglmllml, of l, hc head is uni/icd with i, he begin,ring l, ilnC arguuleul, of i, he lirsl, body goal, and the end t;inlc argumenl, of the head is unilied with the elK |I, inle of the last, body goal. For eXalllp\]e, \[,h,.! following MM-I)C(; rule (for a single ,%r<un): nounphrase -- > article, adj, noun.</Paragraph> <Paragraph position="2"> is translates inl,o: noun_phrase(T0, T, N0, N) :article(T0, TI, NO, Nl), adjective(T2, T3, NI, N~), noun(T4, T, N2, N).</Paragraph> <Paragraph position="3"> or, in Fmglish, There is a retail-phrase l~etu,ecn NO and N if there is an article Iwtwccn NO and NI, aud if there is an adjectiw! I,etu,~,,,u NI and N-), aml if there is a noun hetwec'n N2 and N, The nount~hrasc starts at (1'0, nml cm/s at T. The article starts at TO, a11d eu,ls at TI. &quot;l'lw a+(j('ctivc starts at T2, and ends at &quot;1'3 \[l'hc tloutl starts at T4, aml ends at T.</Paragraph> <Paragraph position="4"> A rule with a terniinal sylllboI is II'allS\]alcd illlO a ullil; ciallse, l&quot;or examl)lc , noun --> keyboard:\[window\].</Paragraph> <Paragraph position="5"> trails\[aLes into: noun(Ts,Te, \[input(Ts,Te, + <window' ') IN\] ,N) . A funcLor input/3 is inseri,cd into the third argunmnt forlllili,~ the input, s\[,rCalll of {,he+ predicate. The third al'~lllllOnl, of t, he t'llllCl\[,or input/3 is the act, ual input item, the &quot;wimhm,&quot; string in this example.</Paragraph> <Paragraph position="6"> The first and second al'gUillelll, of input/3 is unitied wiLh the first and second argument of this unit clause r{~spectiw!ly, Th,~refore, if a string &quot;window&quot; is input via lhe keyboard ~t, reum, the noun category is instant|areal, and the beginldng and end time of the noun category is Llle same as t, lle start and (!lid Lime attached to the &quot;window&quot; input.</Paragraph> </Section> <Section position="2" start_page="834" end_page="834" type="sub_section"> <SectionTitle> 5.2 Exte, nsion |;o Artdtrary Nmntmr of </SectionTitle> <Paragraph position="0"> S trealliS Exl, ension frol. a single st, ream to nmltiple streams is easy. E;t('h stream needs four extra arguments - two for t, imiug iuformnt, iou and two for expressi.g a consumed input Si.l'Calll, Thus, i\[' there are n modes, 4n arguments arc ~.ldcd into head and goals argunlenl;s.</Paragraph> <Paragraph position="1"> For exanll~lc: , if l, hcre are two streams, the noun_phrase defiuitioa in Lhc previous section is translated into the following prolog l,'edicaws with eight (2 x 4) extra argillllell\[,S: null llOUU_i3hras e (TxO, Tx, ~IxO, Nx, Ty O, Ty, NyO, Ny) : article(TxO,Txl ,NxO,Nxi,TyO,Tyl,NyO,Nyl), adjective (Tx2,Tx3, Nxl ,Nx2,Ty2,Ty3, Nyi ,Ny2), noun(Tx4,Tx,Nx2,Nx,Ty4,Ty,Ny2,Ny).</Paragraph> </Section> <Section position="3" start_page="834" end_page="835" type="sub_section"> <SectionTitle> 5.3 Extractions of Temporal Information </SectionTitle> <Paragraph position="0"> If there is at variable bindi\]tg within a goal like, Tinle -goal the goal is t, ranslat, cd into a con,jullcl,ion of two body goals (for u single mode): (goal(T0, &quot;1'1, R0, R),Time-- (T0, T1) ) Ifthere exist n streams, tim variable Time is bound to a list of n time pairs, such as ~n'two modes: (goal(TxO,Txl,NxO,gxl,TyO,Tyl,NyO,Nyl), Time = \[(TxO, Txl), (TyO, Tyl)\] )</Paragraph> </Section> </Section> <Section position="7" start_page="835" end_page="835" type="metho"> <SectionTitle> 6 Related work </SectionTitle> <Paragraph position="0"> The idea of understandillg multi-modal inputs in conjunction with each other, as presented in this paper, is not particularly new. The idea of a nnllti-n/odal input combining motions and pointing has been explored in a number of contexts. The classic 1980 paper &quot;I>ut-That-There&quot; \[Bolt, 1980\] describes an early system that procedurally combined voice and gesture inputs. This idea was further explored in terms of integrating natural language and pointing by \[/Iayes, 1988\], who related nmlti-modal inputs to anaphoric reference in imtural language processing, particularly to t.he work o\[' \[Grosz, 1977\] and \[Sidner, 1979\]. Recent work in the design of direct manipulation interfaces has also explored the notion of integrating a set of diverse inpuls. Othe.r palmrs exploring multimodal interfaces include \[Allgayer el. al., 1989; Cohen el. al., 1989; Cohen, 1991; Kobsa et. al., 1986; Wahlster, 1989\]. Most of this work, howew.'r, has tbcused on the application of the ideas, and not on the principles for integrating the different inputs. 1</Paragraph> </Section> class="xml-element"></Paper>