XML Viewer - j95-1003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/j95-1003_metho.xml
Size: 54,325 bytes
Last Modified: 2025-10-06 14:13:59
<?xml version="1.0" standalone="yes"?>
<Paper uid="J95-1003">
  <Title>Automatic Referent Resolution of Deictic and Anaphoric Expressions</Title>
  <Section position="3" start_page="60" end_page="61" type="metho">
    <SectionTitle>
2. Overview of EDWARD
</SectionTitle>
    <Paragraph position="0"> EDWARD is implemented in Allegro Common Lisp and runs on DECstations. Figure 1 presents a schematic overview of EDWARD's system architecture. The arrows represent the information flow between the main components. EDWARD accepts input from two devices: ke.yboard and mouse device. The output is directed to two devices on the screen: a NL output text window and a graphics display, and, optionally, to a speech synthesizer. The dialogue manager coordinates input and output expressions and controls the linguistic and graphics processes. It maintains the Context Model, the knowledge base, and the lexicon; in addition, it decides which individual instances stored in the knowledge base must be represented on the graphics display, and it makes sure that the display is always up to date. The language interpreter and the language generator consult the Context Model, the knowledge base, and the lexicon.</Paragraph>
    <Paragraph position="1"> Both the interpreter and the generator operate in an incremental fashion. Figure 2 illustrates how the user can interact with EDWARD.</Paragraph>
    <Paragraph position="2"> The area occupying most of the screen is the graphics display: a window called Modelwereld (Model World). The tree shown in Figure 2 represents a hierarchy of directories (depicted as bookcases) and files (e.g., reports, papers, e-mail messages, and books). The viewport shows only part of the Model World window, which in principle extends indefinitely. In the bottom-left corner of the viewport, a garbage container and a copier are displayed. The bear icon, at the bottom in the middle, represents the system itself (i.e., EDWARD). Using a mouse, the user can manipulate the graphical representation of the domain objects by pointing, clicking, and dragging. At the bottom of the Model World window, a mouse documentation bar is presented (the</Paragraph>
    <Paragraph position="4"> A screen dump of EDWARD. The user is entering the command: Kopieer alle rapporten behalve dit. (Copy all reports except for this one.) after selection of the file icon labeled donald_report.</Paragraph>
    <Paragraph position="5"> Dutch word Linkerknop means 'left button,' versleep means 'to drag,' and Rechterknop means 'right button'). In the bottom-left area of the screen is the NL interaction window labeled Dialoog (Dialogue). Here the user can enter NL commands, questions, or assertions. Depending on the number of words and ambiguities in a linguistic expression, interpretation takes between 0.5 and 1.5 seconds when running on a Personal DECstation 5000. In Figure 2, the user has requested the system to copy all reports except for this one. At the bottom right, the trace window Context displays the salience values of some of the discourse referents. Referents are presented by the name of the concept class they belong to, followed by the number sign (#) and a unique number enclosed in angle brackets, e.g., &lt;directory#4001&gt; and &lt;spin-report#4929&gt; (spin-reports are a special kind of project reports).</Paragraph>
  </Section>
  <Section position="4" start_page="61" end_page="76" type="metho">
    <SectionTitle>
3. Knowledge Sources
</SectionTitle>
    <Paragraph position="0"> To be able to interpret referring expressions, EDWARD uses three knowledge sources: a knowledge base, a context model, and a lexicon. The knowledge base stores the permanent generic and specific world knowledge of the system, whereas the Context Model temporarily &amp;quot;memorizes&amp;quot; which individual instances from the knowledge base have been referred to in the dialogue. The lexicon specifies morphophonological and syntactic features of words and contains links between words and the knowledge base  Carla Huls et al. Deixis and Anaphora that represents lexical meaning. In this section we will describe the knowledge base and Context Model.</Paragraph>
    <Section position="1" start_page="62" end_page="62" type="sub_section">
      <SectionTitle>
3.1 The Knowledge Base
</SectionTitle>
      <Paragraph position="0"> The knowledge base is a semantic network implemented in CommonORBIT (De Smedt 1987), a frame- based language somewhat similar to KL-ONE (Brachman and Schmolze 1985). The nodes in the network represent classes and instances of entities and relations. For example, the class &lt;person&gt; contains two subordinate classes, &lt;man&gt; and &lt;woman&gt;, and the concept of sending an object to someone is represented by a generic relation called &lt;send&gt;. Individual objects in the domain are represented by instances; e.g., an individual who is a man might be represented as &lt;man#24&gt;. If he sends a message, a relation instance is created; e.g., &lt;send#89&gt;. Contrary to KL-ONE, relations have a time interval associated with them, which represents the period of time during which the relation is assumed to hold. A time interval has a start value and an end value. The end value may be *NOW*, which is a dynamic value representing open-endedness in a time interval. Much like in KL-ONE, relations 1 contain role-filler class restrictions and role-set restrictions. For example, with the generic relation &lt;send&gt;, three semantic (case) roles are associated, called &lt;agent&gt;, &lt;goal&gt;, and &lt;recipient&gt;.</Paragraph>
      <Paragraph position="1"> The role-filler class restrictions then specify, for example, that the fillers of the &lt;agent&gt; and &lt;recipient&gt; roles must be either persons or institutions and that the filler of the &lt;goal&gt; role (the object that is sent) must be concrete and excludes persons. This information is used by the interpretation component to restrict the referent sets of the role fillers of a relation. The role-set restrictions specify, for example, that the filler of the &lt;recipient&gt; role in a &lt;send&gt; relation is not, at least not in our current domain, allowed to be identical to the filler of the &lt;agent&gt; role. 2 The interpreter could use these restrictions to exclude certain referents from the set of potential referents.</Paragraph>
      <Paragraph position="2"> Depending on the domain EDWARD is being applied to, a filter is defined to determine which concepts of the knowledge base should be visually represented on the screen. The file system domain filter, for instance, allows instances of particular file system classes, such as directories, e-mail messages, reports, and books. The instances passing the filter are represented by icons that depict their class. The only relation instances passing the file system domain filter are &lt;contain&gt; relations and &lt;name&gt; relations. A &lt;contain&gt; relation is represented graphically by a straight line linking the icon that represents the container and the icon representing the object contained.</Paragraph>
      <Paragraph position="3"> &lt;Name&gt; relations (if present) are represented by a label underneath the icon of the named object (see Figure 2).</Paragraph>
    </Section>
    <Section position="2" start_page="62" end_page="66" type="sub_section">
      <SectionTitle>
3.2 The Context Model
</SectionTitle>
      <Paragraph position="0"> The second knowledge source EDWARD uses to analyze referring expressions is the Context Model. The central notion in this model is salience. The intuitive notion of salience has two important characteristics. In the first place, the salience of an instance at a given moment is determined by a diversity of factors of varying importance. In written language, recency of mention is known to be an important factor, as are syntactic and semantic parallelism, the markedness of expressions and constructions, and so on. Spoken language adds intonation, and when the situational context gets involved, various perceptual factors like visibility join in. The second important characteristic  and modifier Referents of the subject phrase \[2,1, 0\] Referents of noun phrase modifiers \[1, 0\] (e.g., prepositional phrase, relative clause) Relations expressed by subject, \[3, 2,1, 0\] prepositional phrase, and relative clause  of salience is its gradedness. An individual instance may be more or less salient, may gradually become less salient, etc.</Paragraph>
      <Paragraph position="1"> Alshawi (1987) provides a general framework for modeling salience that does justice to both characteristics mentioned above. The central construct in this framework is that of context factor (CF). A CF is defined by a scope, which is a collection of individual instances; a significance weight, represented by an integer; and a decay function, which indicates by what amount the CF's significance weight is to be decreased at the next update. In EDWARD we have adopted Alshawi's notion of CFs and elaborated it. Table 1 presents an overview of the CFs EDWARD uses.</Paragraph>
      <Paragraph position="2"> The salience value (SV) of an individual instance (inst) at any given moment is obtained simply by adding the current significance weights of the CFs which have that instance in their scope:</Paragraph>
      <Paragraph position="4"> Henceforward, we will say that an individual instance is in context if its SV is more than 0. The elegance of this particular notion of salience is that it allows for a unified measure of salience, which is determined by an indefinite number of independent factors that can be monitored separately. This architecture differs from the architectures of related work on multimodal interfaces described in the introduction, which all adopt Grosz and Sidner's approach to modeling referents in context. In Section 5, we will compare their approach with ours.</Paragraph>
      <Paragraph position="5"> In EDWARD we presently use seven CFs (see Table 1): four serve to model linguistic context effects and three to model perceptual context effects. The linguistic CFs are major-constituent referent CF, subject referent CF, nested-term referent CF, and relation CE Major-constituent referents are the referents of the subject, the direct object, the indirect object, and the main modifiers of a sentence. They are the role fillers of the relation expressed by the main clause. A major-constituent referent CF has an initial significance weight of 3. (All significance weights have been determined by trial and error and, as will be shown in Section 5, work fine.) Subject referent CFs model the  Carla Huls et al. Deixis and Anaphora Table 2 Example of salience value calculation.</Paragraph>
      <Paragraph position="6"> SV of Koen SV of Ria SV of the Article 0 0 0  Koen is de echtgenoot van Ria. 3 + 2 = 5 1 0 Koen is the husband of Ria. subject + major nested Hijschrijfteenartikel. (3-1+2-1)+3+2=8 1-1 =0 3 He writes an article. (existing) + subject + major existing major Hetartikelgaatoverzijn (3-1-1+2-1-1+3 3 (3-1)+3+2=7 vrouw. -1+2-1)+1=5 The article is about his wife. (existing) + nested major (existing) + subject + major observation that referents of subject noun phrases (NPs) are more salient than referents of the other major clause constituents. Their initial significance weight is 2. 3 Nested-term referents are the referents expressed by NP modifiers. These referents are mentioned in the sentence, but they are less prominent than the subject referents or major referents. Nested-term referent CFs have an initial significance weight of 1. Relation CFs are created for all the relations expressed by a sentence, e.g., by the main clause, or by NPs modifying prepositional phrases. Their purpose is to make references to actions expressed in a sentence possible, as in, for example, do it again. Their initial significance weight is 3. The decay function of the linguistic CFs subtracts 1 from a CF's weight at each successive update. If a CF's weight equals 0, the CF is discarded. Table 2 shows how the salience of some individual instances changes in the course of a short dialogue. The three rightmost columns present the SVs after the interpretation of the utterance in the left column. These values are used for the interpretation of the next sentence. After each sentence, the existing CFs are updated by calling their decay functions, and new CFs are created.</Paragraph>
      <Paragraph position="7"> The perceptual CFs are as follows: visible referent CF, selected referent CF, and indicated referent CE Visible referent CFs cause referents that are visible to have a higher SV than referents that are not visible. A visible referent CF has an initial significance weight of 1, so a referent that is visible will be a little more salient than a referent that is not. As soon as the graphical representations (icons) of the referents in the scope of a visible referent CF become invisible (e.g., as a result of a scroll action), the weight drops to 0 and the CF will be discarded. Selected referent CFs cause selected referents to be more salient than referents that are merely visible. A selected referent CF is created when an icon has been selected by the user (by moving the mouse to the icon and clicking the left mouse button), or when the user has requested the system (in natural or formal language) to select icons. Its significance weight is initially 2, and it remains 2 for as long as the icon remains selected. As soon as the icon is deselected, the weight drops to 0 and the CF will be discarded. An indicated referent CF, finally, causes a referent that is indicated by either the system or the user to be very salient for a short time. Indication by the system is done by means of a simulated pointing gesture: a fat, animated growing arrow to a particular icon (for instance, generated upon the question &amp;quot;Which e-mail message is about parsing?&amp;quot;). An indicated referent CF has an initial significance weight of 30 to make sure that the referent in its scope  Computational Linguistics Volume 21, Number 1 Table 3 The four types of referring expressions. Anaphoric expressions are only possible in the NL mode. EDWARD is able to deal with all four types.</Paragraph>
      <Paragraph position="8">  will be the most salient one immediately after the pointing has occurred. After the first update, its significance weight drops to 1, and at the next update, it becomes 0. Notice the difference between selection and indication. Selection is an action only the user can initiate; if the selection is done with a pointing action, both a selected referent CF and an indicated referent CF are created (e.g., for donald_report in Figure 2); otherwise only a selected referent CF is created. However, both the user and EDWARD can point, creating indicated referent CFs; pointing has a more temporary effect than selection. 4. Interpreting Deictic and Anaphoric Expressions in EDWARD EDWARD is able to interpret the two kinds of referring expressions distinguished in the introduction, viz., deictic and anaphoric expressions. When combined with the three categories of interaction modes--unimodal graphical, unimodal linguistic, and multimodal--this results in the four types of referring expressions listed in Table 3. 4 The basic principle that is used by EDWARD to solve referring expressions is the same for all four types of referring expressions shown in Table 3. Both EDWARD's graphics processes and its syntactic, semantic, and pragmatic interpretation processes operate on line (i.e., interpretation starts directly and goes on while the user enters the remaining of his utterance), incrementally (i.e., the interpretation is built up piece by piece from left to right), and in parallel (i.e., more than one interpretation process can be handled at every moment).</Paragraph>
      <Paragraph position="9"> To determine the referent of a phrase, first all individual instances satisfying the semantic restrictions of the phrase are listed. The one with the highest SV, being the most likely referent, is put at the front. Next, after completion of the phrase, the salience of each referent is retrieved by adding the significance weights of all CFs that have this individual instance in their scope, s The most salient individual instance is taken to be the referent of the phrase. In the final sentence of Table 2, for example, the referent of the phrase het artikel (the article), is the most salient individual instance belonging to the class Karticle&gt; or to any of its subordinate classes. This approach 4 Referring by name is not included in this table, because it is neither a deictic nor an anaphoric reference. However, EDWARD solves referring by name the same as it does the other four types of referring expressions. 5 The programming language CommonORBIT used in EDWARD provides pointers back from the object to the CFs that have the object in its scope (which compares to Alshawi's notion of marking).  Carla Huls et al. Deixis and Anaphora implies that if a particular individual instance has the highest SV, the user need not be very specific and can use, for example, het (it), die (that one), die file (that file), or dat ding (that thing). If the highest SV is shared by several instances (a tie), EDWARD will ask the user to indicate which of the candidates is intended (e.g., &amp;quot;Do you mean donald_report?&amp;quot;). The following three subsections describe how EDWARD deals with the specifics of the four types of referring expressions in turn.</Paragraph>
    </Section>
    <Section position="3" start_page="66" end_page="69" type="sub_section">
      <SectionTitle>
4.1 Unimodal Linguistic Reference
</SectionTitle>
      <Paragraph position="0"> dit (this), deze files (these files); personal pronouns, e.g., hij (he), het (it); and adverbs, e.g., daar (there).</Paragraph>
      <Paragraph position="1"> To determine the referent of an anaphoric expression, the interpretation component retrieves the most salient, semantically appropriate referent. The salience of a referent is influenced by both linguistic and perceptual context, as was described in Section 3.2. Plural reference is handled by using sets. To illustrate this, suppose EDWARD has just generated Het bevat gr2_report en qbgc. (It contains gr2_report and qbgc.). At that point, a set instance &lt;set#1189&gt; consisting of &lt;spin-report#6362&gt; and &lt;spinreport#6173&gt; is in context, as are the two individual file instances (though they have lower SVs than the set instance). If the user enters Verwijder die. (Remove them.), die (them) is considered to refer to the most salient instance satisfying the semantic restrictions, in this case &lt;set#1189&gt;.</Paragraph>
      <Paragraph position="2"> An interesting subset of anaphoric expressions are inferential anaphors. Inferential anaphors are references to individual instances that are not explicitly introduced in the dialogue, but are implicitly introduced by associated instances: e.g., The secretary in the sentence pair The NICI has 80 employees. The secretary is called Hil. To identify the correct referent, an inference must be made, in this case that institutes employ secretaries.</Paragraph>
      <Paragraph position="3"> Haviland and Clark (1974) called this type of inference a bridge. There are (at least) two ways to have the system &amp;quot;cross the bridge&amp;quot; and resolve inferential anaphors. The first involves the incorporation of associative CFs that create some salience for associates of individual instances just mentioned (e.g., upon mentioning of the NICI, creating associative CFs for the institute's secretary, its director, its hosting university, etc.). We have discarded this option because it is unattractive from a computational point of view. In many domains, the number of associated individual instances of a mentioned individual instance may be very high. Creating associative CFs for all of these associate individual instances is computationally expensive, especially since most of them would have been created without being of any use (only seldom are there several bridges to cross simultaneously). In a worst case scenario, associative CFs interfere with the referent resolution of normal anaphoric expressions. Not-mentioned individual instances that are in the intersection of the sets of associate individual instances of several consecutively mentioned referents may become more salient than instances that have been mentioned. For example, suppose Herb, the brother of the boss of the NICI, and Catherine, the boss's sister, visit the NICI. Upon interpretation of Herb and Catherine visit the NICI, the boss of the NICI would have some salience owing to three associate CFs that have been created for it. But any subsequent male pronoun (he, him, his) can refer only to Herb and not to the not-mentioned boss of the NICI.</Paragraph>
      <Paragraph position="4"> In the second solution, associate individual instances are not in focus as long as interpretation of referring expressions can work as described above. If no referent can be found by the interpreter for a particular phrase, e.g., no secretary is in context in the case of The NICI has 80 employees. The secretary is called Hil, for all referents that are in context, starting with the one with highest salience, their associated individual instances  Computational Linguistics Volume 21, Number 1 are retrieved and matched with the class of the phrase. We currently use the following tentative heuristic for associated individual instance retrieval: All relations are taken into account between the referent in context (in this example, &lt;department#276&gt;, having a &lt;name&gt; relation with NICI) and a referent of the requested class that can be expressed by the lemma van (of). In the example, this simulates the NPs: the secretary of the department. 6 An advantage of this approach is that referent resolution for phrases other than inferential anaphors is not affected. No effort is wasted in creating associative CFs for individual instances that are not mentioned. Starting the search process at the most salient instance saves computational costs.</Paragraph>
      <Paragraph position="5">  Personal deixis. The intension of the personal pronouns ik (I) and jij (you) is represented using the following predicates:</Paragraph>
      <Paragraph position="7"> where the predicate cognizer is taken from Pylyshyn (1984), meaning any rational agent, e.g., a person or a dialogue system, and talking-to is a predicate that represents the dialogue situation at any time. For example, when the user is entering an input sentence, the clause talking-to(user, system) is true so the pronoun ik (I) refers to the user. It is the dialogue manager's task to keep track of who is talking to whom and to update the knowledge base accordingly.</Paragraph>
      <Paragraph position="8"> Temporal deixis. The interpretation of temporal deixis critically depends on the time of speech of the utterance. EDWARD uses the machine time as an anchoring point.</Paragraph>
      <Paragraph position="9"> For example, the time interval of the relation &lt;live-in#I&gt; expressed by Koen woont in Nijmegen. (Koen lives in Nijmegen.) is an open-ended time interval starting at the machine time at the time of interpretation and ending at *NOW*. If another related relation is added to the knowledge base, e.g., &lt;live-in#2&gt; expressed by a subsequent Koen woont in Amsterdam. (Koen lives in Amsterdam.), the open-ended time interval of the first &lt;live-in&gt; relation is closed, ending at the current machine time at the time of interpretation of the second relation. The first &lt;live-in&gt; relation can now be referred to in simple past tense; the second, in present tense. For example, in case of a subsequent question like Woont Koen in Amsterdam? (Does Koen live in Amsterdam?), the time interval of this question relation, viz., *NOW*, is included by the time interval of &lt;live-in#2&gt; found in the knowledge base, and thus the system would respond with Ja, hij woont er. (Yes, he lives there.). If, however, the question were Woont Koen in Nijmegen? (Does Koen live in Nijmegen?), *NOW* is not included by the time interval of &lt;live-in# 1&gt;, the relation no longer holds, and the system would respond negatively.</Paragraph>
      <Paragraph position="10"> Since the system, in this case, knows what &lt;live-in&gt; relation does hold, it can respond cooperatively with Nee, hij woont in Amsterdam. (No, he lives in Amsterdam.). Currently, simple present and simple past tense are the only two tenses handled.</Paragraph>
      <Paragraph position="11"> Spatial deixis. The presence of a visible model world invites the user to generate referring linguistic expressions involving the spatial environment. We call definite NPs referring to the only object of a certain type visible at that moment implicit spatial  Carla Huls et al. Deixis and Anaphora Figure 3 Reference resolution of spatial descriptions: a schematic lay out of two directory icons and two file icons.</Paragraph>
      <Paragraph position="12"> deixis. An example is the NP the closed bookcase in the case that only one icon resembles a closed bookcase. EDWARD solves this type of referring expression simply by obtaining the most salient object of the right type. The object will be in the scope of the visibility CF, and if no other object of this type is in context, the visible object thus will be selected as the referent.</Paragraph>
      <Paragraph position="13"> Explicit references to the spatial environment are references to spatial relations.</Paragraph>
      <Paragraph position="14"> Spatial relations can be divided into topological relations and projective relations (Retz-Schmidt 1988). Examples of topological relations are IN, AT, and NEAR. Topological relations (e.g., the file near it) refer to topological relations between the referent and the relatum (in this example, the object referred to by it). Examples of projective relations are IN FRONT OF, BETWEEN, LEFTMOST, and BESIDE. Projective relations convey information about the direction in which an object is located with respect to another object or to the world. A particular linguistic expression describing a projective relation can be used in three different ways: deictically, intrinsically, and extrinsically. The phrase the ball in front of the car, for example, can have three interpretations. It could mean that the bali's location is referred to in relation to the car from the speaker's point of view (deictic use), or with respect to the orientation of the car itself (intrinsic use), or with respect to the actual direction of motion of the car (extrinsic use).</Paragraph>
      <Paragraph position="15"> In EDWARD all linguistic expressions describing spatial relations are interpreted deictically. For the time being, this restriction does not cause problems. Extrinsic use of, for example, the projective preposition left of, i.e., left of an object that is being dragged by the user, when looking in the direction of dragging, is currently impossible since the user cannot drag and write linguistic expressions simultaneously. Intrinsic use of, for example, left of and right of is assumed to be rare in the current domain: none of the now more than 50 users that have interacted with EDWARD used it.</Paragraph>
      <Paragraph position="16"> To determine the referent of a spatial expression, the visible Model World is scanned for a referent, using the intension of the spatial relation and the relatum.</Paragraph>
      <Paragraph position="17"> The area to be scanned depends on the context. For some relations, the boundaries of the Model World are searched for (e.g., the bottom most file); for others, the area in the relatum's vicinity (e.g., the file left of donald_report), or the area of the most salient objects (e.g., the file on the left if the directory containing that file is very salient) are searched for.</Paragraph>
      <Paragraph position="18"> Now let us consider a more complex example. Suppose there are two directory icons and two file icons, positioned as schematically indicated in Figure 3. Suppose all objects have a SV of 1, and no other files and directories are in context (i.e., have a SV greater than 0).</Paragraph>
      <Paragraph position="19">  Computational Linguistics Volume 21, Number 1 Both expressions, the file and the directory, are ambiguous and would force EDWARD to start a clarifying user consult. However, the spatial description the file below the directory is unambiguous. Relatum and referent support each other in reference solution. EDWARD scans the vicinity of both relata &lt;dir#1&gt; and &lt;dir#2&gt;. Since it finds a referent (&lt;file#I&gt;) only for &lt;dir#1&gt;, it can determine &lt;file#l&gt; as referent and &lt;dir#1&gt; as relatum.</Paragraph>
    </Section>
    <Section position="4" start_page="69" end_page="69" type="sub_section">
      <SectionTitle>
4.2 Unimodal Graphical Reference
</SectionTitle>
      <Paragraph position="0"> Because the notions of deixis and anaphora make sense only in the language mode, we cannot apply this distinction to the action mode. All unimodal graphical reference is considered deictic.</Paragraph>
      <Paragraph position="1"> The graphics analyzer interprets the pointing gestures produced by the user. An additional opportunity in simulated pointing that is not available in normal gesturing is the provision of feedback about the success of a pointing gesture. The indicated object, henceforward referred to as the demonstratum, (e.g., a file icon, directory tree, or screen position), is marked using reverse video and becomes selected. Usually, the user points to an object to indicate that it is the argument of the command he wants to perform, e.g., a file copy command. Objects remain selected until the user points to another object or explicitly deselects the selected object. The pointing gestures that the system produces have been designed not to interfere with user selection. The graphics analyzer always immediately updates the selected CF of the demonstratum.</Paragraph>
      <Paragraph position="2"> The user can simulate pars-pro-toto and totum-pro-parte pointing gestures. In pars-pro-toto pointing, an object is selected by pointing to a pixel that is within the object's selection area (which encloses the area covered by its icon) and subsequently pressing the select object mouse button. By simultaneously pressing the multiple selection key, multiple objects can be selected. In totum-pro-parte pointing, objects are selected either by enclosing the icons in a mouse-driven rectangle, or by pointing to an icon that is part of a compound object, typically the root of a directory tree, and pressing the select compound object mouse button.</Paragraph>
      <Paragraph position="3"> Notice that all simulated pointing gestures are in principle ambiguous: they can refer either to the positions themselves or to the objects located at these positions.</Paragraph>
      <Paragraph position="4"> When operating in the action mode, i.e., selecting and manipulating graphical representations, the gestures can be taken to refer to the objects at the positions indicated, since screen positions cannot be manipulated.</Paragraph>
    </Section>
    <Section position="5" start_page="69" end_page="71" type="sub_section">
      <SectionTitle>
4.3 Multimodal Referring Expressions
</SectionTitle>
      <Paragraph position="0"> Multimodal deictic referring expressions combine referring linguistic expressions with simulated pointing gestures. Since pointing to time is impossible, only spatial and personal deixis is possible in multimodal referring expression. Demonstrative expressions (e.g., dit bestand/deze \[this file/this one\]) in combination with the realization of an appropriate pointing gesture are common examples of multimodal referring expressions. Notice, however, that demonstrative phrases are not necessarily accompanied by pointing gestures (they can be used anaphorically as well; see Section 4.1.3). Moreover, pointing gestures can also be combined with other, non-demonstrative definite NPs: Het rapport over DoNaLD zit in ClaassenS. (The report about DoNaLD is in ClaassenZ; with a pointing gesture to the Claassen directory).</Paragraph>
      <Paragraph position="1"> To determine the referent(s) of (multimodal) referring expressions, the interpretation component retrieves the most salient referent that satisfies the semantic restrictions of the input phrase. The salience of a referent is influenced by both linguistic and per- null Carla Huls et al. Deixis and Anaphora ceptual CFs, so the multimodal referring expressions are solved in exactly the same way as unimodal referring expressions. Consider, for instance, the interpretation of dit (this one) in sentence (2a) versus the interpretation in sentence (2b) following the NL  command (1): (1) Zoek het rapport over Gr2. (Find the report about Gr2.) (2a) Kopieer alle rapporten behalve dit. (Copy all reports except for this one.) (2b) Kopieer alle rapporten behalve ditS. (Copy all reports except for this/~ one;  where the report named donald_report is the demonstratum).</Paragraph>
      <Paragraph position="2"> Let us assume .that the referent of the report about Gr2 has a SV of 3 just before sentence (2a) or (2b) is interpreted. The referent of dit (this one) in sentences (2a) and (2b) would be the most salient report at that moment, which would be the report about Gr2 in sentence (2a), but the report pointed to (donald~report) in sentence (2b). Notice that multimodal expressions with a redundant pointing gesture (e.g., gr2_reportZ if there is just one object named gr2_report in the context) are solved the same way. Now, what happens if the user uses multiple pointing gestures within one utterance as in the example Zet deze file hierS, en dezeS daarT. (Put this file hereS, and this7 one thereT.)? The fact that both EDWARD's graphics processes and its syntactic, semantic, and pragmatic interpretation processes operate on line, incrementally, and in parallel implies that the context effects of a pointing gesture can immediately be taken into account by the reference analysis process. So, if the user points to an icon, the salience of its referent increases immediately, making it the most likely candidate referent of the phrase at hand. By the time the user starts to point a second time, the analysis of the previous multimodal referring expression has been completed, and the context effect of the second pointing gesture is used to solve the corresponding referring expression.</Paragraph>
      <Paragraph position="3"> Empirical evidence shows that deictic gestures are indeed exactly coordinated with their associated verbal expressions. Marslen-Wilson et al. (1982), for example, observed that their subject's pointing gestures occurred simultaneously with the demonstrative in the associated NP, or when no demonstrative was used, with the head of the corresponding NP. They report no deictic gestures after completion of the corresponding NP. This implies that the timing of their subject's pointing gestures would satisfy the restriction mentioned above.</Paragraph>
      <Paragraph position="4"> Since pointing yields both the screen location pointed to and the object positioned at that location, it is the interpreter's job to disambiguate. Furthermore, more ambiguity arises if two objects have selection areas that partially overlap and the user points in this intersection area. EDWARD cannot determine which object's area the user referred to unless this pointing action is part of a multimodal expression such as dit7 boek (thisZ book). The graphics interpreter passes all candidates (in this case, for example, Kscreen-position#798&gt;, &lt;book#248&gt;, Kreport#546&gt;) on to the dialogue manager, which brings them temporarily in the context. That is, an indicated CF and a selected CF are created for each of them. Guided by the language interpreter, the dialogue manager then decides which of the referents was intended. In the case of dit boek, pointing to a report or screen location was not intended, and thus the dialogue manager decides that the indicated CFs and selected CFs update of the report and screen location were invalid. It kills these CFs and subsequently deselects the unintendedly selected object.</Paragraph>
      <Paragraph position="5">  Computational Linguistics Volume 21, Number 1 5. Assessing the Quality of EDWARD's Referent Resolution Model To assess the quality of EDWARD's referent resolution model, we collected a series of referring expressions, which were processed by three different referent resolution models, namely that of EDWARD, as described above, a very simplistic model, and the sophisticated and often applied model proposed by Grosz and Sidner (1986). Since there are no benchmarks available to evaluate referent resolution models, we had subjects interact with EDWARD to compile a set of referring expressions. Usually, NL test sentences are made up by evaluators/designers themselves, but we think made-up test sentences may to some extent be unconsciously biased. In the course of developing EDWARD's referent resolution model, we used hundreds of test sentences made up by ourselves to debug and test the program. Real referring expressions, generated by users not familiar with the internal processes of the interpreter, provide a more solid empirical basis for evaluation. In Section 5.1, we present an overview of these user-generated referring expressions. In Section 5.2, we briefly describe the way the two alternative referent resolution models work. The results of feeding the test sentences to the three different referent resolution models are given in Section 5.3. In assessing the quality of a referent resolution model, it is, however, also necessary to analyze the internal affairs of the model and determine the inherent limitations that follow from its design. In Section 5.4, we present the inherent limitations of EDWARD's referent resolution model as well as those of the two alternative models.</Paragraph>
    </Section>
    <Section position="6" start_page="71" end_page="72" type="sub_section">
      <SectionTitle>
5.1 A Test Set of Referring Expressions
</SectionTitle>
      <Paragraph position="0"> By having five subjects (two men and three women) interact with EDWARD, we obtained a total of 125 real, user-generated referring expressions. The subjects all had some previous experience with the system, but this was limited to 1 or 2 hours and dated from 2 to 3 months before. None of them had knowledge of the internal affairs of EDWARD's referent resolution model. The subjects were to perform 19 tasks; most were information retrieval tasks, but some tasks involved effectuating a change in the file system. The subjects were not informed which words and syntactic and semantic constructs could be handled by the system and which could not, but they all knew from their previous encounters that the system was not an unrestricted NL interface. We did explicitly encourage the subjects to use the shortest referring expression possible whenever they felt it was appropriate. From earlier experiments with EDWARD (Huls and Bos 1993; Huls et al. 1993), we know that some users are reluctant to use referring expressions other than by name (probably due to the impact of command language interfaces for familiar file management systems). Examples of the tasks the subjects were to perform are the following: .</Paragraph>
      <Paragraph position="1"> .</Paragraph>
      <Paragraph position="2"> .</Paragraph>
      <Paragraph position="3"> Find out who is the boss of the NICI; followed by Find out who is the secretary of the NICI.</Paragraph>
      <Paragraph position="4"> Find out who live in Nijmegen; followed by Find out whether all women living in Nijmegen work at the NICI.</Paragraph>
      <Paragraph position="5"> Put a copy of this \[experimentor points at leftmost file on screen\] file in this \[experimentor points again\] directory.</Paragraph>
      <Paragraph position="6"> These tasks were supposed to induce inferential anaphors (1), plural referring expressions (2), spatial deixis (3), and multimodal referring expressions (3).  Carla Huls et al. Deixis and Anaphora As we expected, different subjects performed the tasks differently. Some, for example, needed two questions to find out who is the secretary of the NICI, others just one, of which two subjects indeed used the induced inferential anaphor. Table 4 shows several translated sample sentences taken from the set of sentences the five subjects keyed in to perform the 19 tasks. To show the variety in use of referring expressions, we present under (a) the sentences with the largest amount of deictic and anaphoric expressions keyed in by the subjects and under (b) the least amount. For example, (19a) shows the sentence subject #4 used for task 19, with two pronouns, and (19b) shows subject #3's sentences with only one pronoun: The frequency with which the different types of referring expressions occurred can be found in Table 5. Here a clearer view on the variety among subjects in the way of referring is presented. (The types of referring expressions of Table 5 do not exactly match the four types mentioned in Table 3. Unimodal graphical deixis was not encouraged in the experiment and therefore did not occur; reference by name occurred frequently, but this type of reference is not considered to be deictic or anaphoric, and their interpretation is therefore less interesting from a computational linguistics point of view.) Finally, we present some data on the frequencies of use of the two most common words that can feature in both deictic and anaphoric expressions, viz., dit and deze (two demonstrative pronouns, respectively neuter and non-neuter). Table 6 shows the variety in use.</Paragraph>
    </Section>
    <Section position="7" start_page="72" end_page="73" type="sub_section">
      <SectionTitle>
5.2 Two Alternative Referent Resolution Models
</SectionTitle>
      <Paragraph position="0"> The sentences with the referring expressions as described in the previous section were processed by EDWARD's referent resolution model and two alternative referent resolution models. The first alternative model is a very simplistic one. It simply takes the last mentioned semantically appropriate referent. For example, in the sequence The secretary is Hil. Where does she live? the pronoun she is taken to refer to the last mentioned female, in this case Hil. We implemented this Simplistic Model and provided EDWARD with a switch to determine whether sentences should be processed either with the original Context Model or with this alternative Simplistic Model. Each referent mentioned in the dialogue is put on a stack, and when interpreting a referring expression, the stack is processed from top to bottom. To prevent uncontrolled growing of the stack, we had the system discard the object at the bottom of the stack as soon as the stack length exceeded a certain maximum.</Paragraph>
      <Paragraph position="1"> The second alternative referent resolution model is that of Grosz and Sidner (1986).</Paragraph>
      <Paragraph position="2"> Their model consists of two separate mechanisms, each resolving a specific type of referring expression. The first mechanism is called focusing. Focusing is used to limit the information that must be considered in identifying the referents of certain classes of definite NPs. A stack is created in which the focus spaces corresponding to the discourse segment purposes are stored. All entities mentioned in a discourse segment purpose and all related entities (e.g., parts of mentioned entities) are stored in a focus space. New focus spaces are put on top of the focus stack, and the referent for a NP will be searched from the top down. For our data analysis, we suppose a new discourse segment purpose for each new sentence. A second mechanism, called centering (or immediate focusing), is used for pronoun resolution. In brief, a backward-looking center is associated with each utterance in a discourse segment. Of all focused elements, the backward-looking center is the one that is central in that utterance. A combination of syntactic, semantic, and discourse information is used to identify the backward-looking center. The fact that some entity is the backward-looking center is used to constrain the search for the referent of a pronoun in a subsequent utterance. Unfortunately,  Computational Linguistics Volume 21, Number 1 Table 4 A translated compilation of the sentences the subjects used to perform the tasks. Under (a), the sentences with the largest amount of deictic and anaphoric expressions are given; under (b), the sentences with the least amount of deictic and anaphoric expressions are given.</Paragraph>
    </Section>
    <Section position="8" start_page="73" end_page="74" type="sub_section">
      <SectionTitle>
User's Input EDWARD's Output
</SectionTitle>
      <Paragraph position="0"> Grosz and Sidner's model presupposes several sorts of information at moments when EDWARD's interpreter does not have these available. Consequently, we could use only a pen and paper analysis of how their model processes the test set of referring expressions.</Paragraph>
    </Section>
    <Section position="9" start_page="74" end_page="75" type="sub_section">
      <SectionTitle>
5.3 How the Referent Resolution Models Dealt with the Test Set
</SectionTitle>
      <Paragraph position="0"> The sentences with 125 referring expressions entered by the five users to perform the 19 tasks were processed by the three referent resolution models. Table 7 shows the scores.</Paragraph>
      <Paragraph position="1"> 5.3.1 Context Model. EDWARD's Context Model determined the right referent in all 125 referring expressions. However, in the session of subject #2, we discovered an  Computational Linguistics Volume 21, Number 1 error in the interpretation of a dozen sentences this subject keyed in just for curiosity after she had completed the 19 tasks. She continued as follows: Alice wrote him an e-mail.</Paragraph>
      <Paragraph position="2"> Put that e-mail in Dijkstra.</Paragraph>
      <Paragraph position="3"> Is thisTthe e-mail to Lou? Which one? What is in this e-mail? What is the topic of this e-mail? Where is her e-mail to Lou? OK, information added.</Paragraph>
      <Paragraph position="4"> OK, information added.</Paragraph>
      <Paragraph position="5"> No, thisTone is.</Paragraph>
      <Paragraph position="6"> ThisTone.</Paragraph>
      <Paragraph position="7"> Sorry, please rephrase.</Paragraph>
      <Paragraph position="8"> I don't know.</Paragraph>
      <Paragraph position="9"> Here, the referent of the pronoun her was mentioned too long ago for EDWARD to be able to locate the referent Alice. EDWARD therefore had to ask the user Whom do you mean with 'her'?  surprising: we counted only 6 misses. Task (15) particularly showed the restrictions of the Simplistic Model: (14) Which are the e-mails sent by Alice? Alice mailed thisSe-mail to Wim, thisTe-mail to Koen and thisTe-mail to Carla.</Paragraph>
      <Paragraph position="10"> (15) She sent an e-mail about Bos to Wietske. The she of (15) is considered to refer to Carla, the last mentioned female, but the user actually referred to Alice. Similar problems occurred with this in task (16). 5.3.3 Grosz and Sidner. Using a pen and paper analysis of how the Grosz and Sidner Model processes the sentences, we think their model resolves all but 1 referring expression correctly. The only problem we encountered concerned the use of two pronouns in one sentence: he lives in her town. The original model excluded these double occurrences.</Paragraph>
    </Section>
    <Section position="10" start_page="75" end_page="76" type="sub_section">
      <SectionTitle>
5.4 Inherent Limitations of the Referent Resolution Models
</SectionTitle>
      <Paragraph position="0"> In this section, we describe several problems of the three reference resolution models that follow from their design but did not become apparent in the test set evaluation.</Paragraph>
      <Paragraph position="1"> First, EDWARD's Context Model and the Simplistic Model do not make any predictions about discourse intention. Discourse intentions play a primary role in explaining discourse structure, defining discourse coherence, and providing a coherent conceptualization of the term &amp;quot;discourse&amp;quot; (Grosz and Sidner 1986). Discourse intentions can provide clues for the beginning and ending of dialogues and subdialogues.</Paragraph>
      <Paragraph position="2"> Referent resolution can make use of this structure to exclude referents to (sub)dialogues that are ended. Furthermore, subdialogues do not interfere with the referent resolution of the main dialogue. Grosz and Sidner's theory of discourse structure, on the other hand, does address these problems.</Paragraph>
      <Paragraph position="3"> The Context Model obviously still lacks several factors that can influence the salience of a referent. An example is the different context effects of reference by a pronoun versus reference by a definite full-fledged NP. Grosz and Sidner mention this distinction but do not, however, provide a thorough analysis of all syntactic, semantic, and pragmatic rules they envisage to play a role in either focusing or centering.</Paragraph>
      <Paragraph position="4">  Carla Huls et al. Deixis and Anaphora A problem for all three of the referent resolution models is the resolution of cataphors. In contrast with anaphors, cataphors refer to instances that will be introduced later in the discourse (e.g., He will win who... ). All three models will (try to) locate the referent of he in the set of individual instances mentioned before. The resolution of cataphors, however, requires a more lazy evaluation.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="76" end_page="77" type="metho">
    <SectionTitle>
6. Conclusions
</SectionTitle>
    <Paragraph position="0"> We have collected some indications about the quality of the Context Model for referent resolution we implemented in our multimodal user interface EDWARD. We have compared the capabilities of this model with two alternative models, both empirically, using a test set of 125 user-generated referring expressions obtained from interactions with EDWARD and, analytically, studying the inherent limitations that follow from the models' designs.</Paragraph>
    <Paragraph position="1"> On empirical grounds we conclude that the Simplistic Model, in which anaphoric expressions are considered to refer to the last mentioned semantically appropriate object, is inadequate. Though it performed, by far, better than we anticipated, too often the wrong object is taken to be the referent.</Paragraph>
    <Paragraph position="2"> The quality of the other alternative model for referent resolution, the Grosz and Sidner Model, seems to compare to the quality of EDWARD's Context Model. As we understand the Grosz and Sidner Model, it processed 124 referring expressions correctly (but this may be inaccurate, since we do not have an implemented version of the model at our disposal). Furthermore, it will have problems with interpreting cataphora properly. EDWARD's Context Model performed well on all 125 test expressions, but cataphora will also be misinterpreted. The Grosz and Sidner Model has a much broader scope. In particular, their model addresses the notion of discourse coherence. It would be interesting to explore how the insights of Grosz and Sidner with respect to discourse coherence can be used to elaborate EDWARD's Context Model to render it able to deal with subdialogues.</Paragraph>
    <Paragraph position="3"> EDWARD's Context Model differs significantly from the Grosz and Sidner Model from an engineering and computational point of view. The Context Model is relatively simplistic. EDWARD never needs to figure out the type of an expression that is being analyzed: for all referring expressions, the most salient referent is chosen. Moreover, entities and relations are handled in a uniform fashion, and syntactic as well as perceptual influences on salience are incorporated into one model. The general applicability of the technique adds to its beauty. The language generation component uses it as well. Both components use the role-filler class restrictions, the cardinality information, and the role-set restrictions from the knowledge base, and they use the same CFs, with the same initial significance weights, and the same decay functions of the Context Model. Grosz and Sidner propose a complex system of rules. In the Context Model, on the other hand, influences originating from different levels and types of processing are modeled by individual CFs, which are created and managed locally, i.e., by these processes themselves. As a result, the influences on an object's salience are represented distributed and independently, which is attractive from a computational point of view.</Paragraph>
    <Paragraph position="4"> Furthermore, the addition of new CFs, which would require explicit detailed changes in Grosz and Sidner's rules, will be easier because the procedures that use the salience information can stay exactly the same.</Paragraph>
    <Paragraph position="5"> Though our empirical and analytical studies were only small and provide no firm basis for drawing conclusions, we do find indications that the quality of EDWARD's Context Model compares to a large extent to the quality of the more complex Grosz and Sidner Model. Therefore, if one is in need of a referent resolution model for a particular  Computational Linguistics Volume 21, Number 1 NL interpreter in a setting where subdialogues are rare, we think that EDWARD's Context Model is a good alternative to the complex rule system of Grosz and Sidner.</Paragraph>
    <Paragraph position="6"> The model is easy to build, to maintain, and to expand, and it is computationally fairly inexpensive.</Paragraph>
  </Section>
  <Section position="6" start_page="77" end_page="77" type="metho">
    <SectionTitle>
Acknowledgments
</SectionTitle>
    <Paragraph position="0"> This research was performed within the framework of the research programme</Paragraph>
    <Section position="1" start_page="77" end_page="77" type="sub_section">
      <SectionTitle>
'Human-Computer Communication using
</SectionTitle>
      <Paragraph position="0"> natural language' (MMC). The MMC programme is sponsored by SPIN</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML