File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1410_metho.xml

Size: 21,483 bytes

Last Modified: 2025-10-06 14:10:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="W06-1410">
  <Title>Algorithms for Generating Referring Expressions: Do They Do What People Do?</Title>
  <Section position="4" start_page="63" end_page="64" type="metho">
    <SectionTitle>
2 The Data
</SectionTitle>
    <Paragraph position="0"> Our human-produced referring expressions are drawn from a physical experimental setting consisting of four filing cabinets, each of which is four drawers high, located in a fairly typical academic office. The cabinets are positioned directly next to each other, so that the drawers form a fourby-four grid; each drawer is labelled with a number between 1 and 16 and is coloured either blue, pink, yellow, or orange. There are four drawers of each colour which are distributed randomly over the grid, as shown in Figure 1.</Paragraph>
    <Paragraph position="1"> Subjects were given a randomly generated number between 1 and 16, and asked to produce a description of the numbered drawer using any properties other than the number. There were 20 participants in the experiment, resulting in a total of 140 referring expressions. Here are some examples of the referring expressions produced:  (1) the top drawer second from the right [d3] (2) the orange drawer on the left [d9] (3) the orange drawer between two pink ones [d12] (4) the bottom left drawer [d16] Since the selection of which drawer to describe was random, we do not have an equal number of  descriptions of each drawer; in fact, the data set ranges from two descriptions of Drawer 1 to 12 descriptions of Drawer 16. One of the most obvious things about the data set is that even the same per-son may refer to the same entity in different ways on different occasions, with the differences being semantic as well as syntactic.</Paragraph>
    <Paragraph position="2"> We are interested in comparing how algorithms for referring expression generation differ in their outputs from what people do; since these algorithms produce distinguishing descriptions, we therefore removed from the data set 22 descriptions which were ambiguous or referred to a set of drawers. This resulted in a total of 118 distinct referring expressions, with an average of 7.375 distinct referring expressions per drawer.</Paragraph>
    <Paragraph position="3"> As the algorithms under scrutiny here are not concerned with the final syntactic realisation of the referring expression produced, we also normalised the human-produced data to remove superficial variations such as the distinction between relative clauses and reduced relatives, and between different lexical items that were synonymous in context, such as column and cabinet.</Paragraph>
    <Paragraph position="4"> Four absolute properties used for describing the drawers can be identified in the natural data produced by the human participants. These are the colour of the drawer; its row and column; and in those cases where the drawer is situated in one of the corners of the grid, its cornerhood.2 A number of the natural descriptions also made use of the 2A question we will return to below is that of how we decide whether to view a particular property as a one-place predicate or as a relation.</Paragraph>
    <Paragraph position="5">  following relational properties that hold between drawers: above, below, next to, right of, left of and between. In Table 1, Count shows the number of descriptions using each property, and the percentages show the ratio of the number of descriptions using each property to the number of descriptions for drawers that possess this property (hence, only 27 of the descriptions referred to corner drawers). We have combined all uses of relations into one row in this table to save space, since, interestingly, their overall use is far below that of the other properties: 103 descriptions (87.3%) did not use relations. null Most algorithms in the literature aim at generating descriptions that are as short as possible, but will under certain circumstances produce redundancy. Some authors, for example (van Deemter and Halld'orsson, 2001), have suggested that human-produced descriptions are often not minimal, and this is an intuition that we would generally agree with. However, a strong tendency towards minimality is evident in the human-produced data here: only 29 out of 118 descriptions (24.6%) contain redundant information. Here are a few examples: * the yellow drawer in the third column from the left second from the top [d6] * the blue drawer in the top left corner [d1] * the orange drawer below the two yellow drawers [d14] In the first case, either the colour or column properties are redundant; in the second, colour and corner, or only the grid information, would have been sufficient; and in the third, it would have been sufficient to mention one of the two yellow drawers.</Paragraph>
  </Section>
  <Section position="5" start_page="64" end_page="65" type="metho">
    <SectionTitle>
3 Knowledge Representation
</SectionTitle>
    <Paragraph position="0"> In order to use an algorithm to generate referring expressions in this domain, we must first decide how to represent the domain. It turns out that this raises some interesting questions.</Paragraph>
    <Paragraph position="1"> We use the symbols {d1,d2 ...d16} as our unique identifying labels for the 16 drawers.</Paragraph>
    <Paragraph position="2"> Given some di, the goal of any given algorithm is then to produce a distinguishing description of that entity with respect to a context consisting of the other 15 drawers.</Paragraph>
    <Paragraph position="3"> As is usual, we represent the properties of the domain in terms of attribute-value pairs. Thus we have, for example: * d2: &lt;colour, orange&gt; , &lt;row, 1&gt; , &lt;column, 2&gt; , &lt;right-of, d1&gt; , &lt;left-of, d3&gt; , &lt;next-to, d1&gt; , &lt;next-to, d3&gt; , &lt;above, d7&gt; This drawer is in the top row, so it does not have a property of the form &lt;below, d2&gt; .</Paragraph>
    <Paragraph position="4"> The four corner drawers additionally possess the property &lt;position, corner&gt; . Cornerhood can be inferred from the row and column information; however, we added this property explicitly because several of the natural descriptions use the property of cornerhood, and it seems plausible that this is a particularly salient property in its own right.</Paragraph>
    <Paragraph position="5"> This raises the question of what properties should be encoded explicitly, and which should be inferred. Note that in the example above, we explicitly encode relational properties that could be computed from others, such as left-of and rightof. Since none of the algorithms explored here uses inference over knowledge base properties, we opted here to 'level the playing field' to enable fairer comparison between human-produced and machine-produced descriptions.</Paragraph>
    <Paragraph position="6"> A similar question of the role of inference arises with regard to the transitivity of spatial relations. For example, if d1 is above d9 and d9 is above d16 , then it can be inferred that d1 is transitively above d16. In a more complex domain, the implementation of this kind of knowledge might play an important role in generating usful referring expressions. However, the uniformity of our domain results in this inferred knowledge about transitive relations being of little use; in fact, in most cases, the implementation of transitive inference might even result in the generation of unnatural descriptions, such as the orange drawer (two) right of the blue drawer for d12.</Paragraph>
    <Paragraph position="7"> Another aspect of the representation of relations that requires a decision is that of generalisation:  next-to is a generalisation of the relations left-of and right-of. The only algorithm of those we examine here that provides a mechanism for exploring a generalisation hierarchy is the Incremental Algorithm (Reiter and Dale, 1992), and this cannot handle relations; so, we take the shortcut of explicitly representing the next-to relation for every left-of and right-of relation in the knowledge base. We then implement special-case handling that ensures that, if one of these facts is used, the more general or more specific case is also deleted from the set of properties still available for the description.3 null</Paragraph>
  </Section>
  <Section position="6" start_page="65" end_page="65" type="metho">
    <SectionTitle>
4 The Algorithms
</SectionTitle>
    <Paragraph position="0"> As we have already noted above, there is a considerable literature on the generation of referring expressions, and many papers in the area provide detailed algorithms. We focus here on the following algorithms: * The Full Brevity algorithm (Dale, 1989) attempts to build a minimal distinguishing description by always selecting the most discriminatory property available; see Algorithm 1.</Paragraph>
    <Paragraph position="1"> Let L be the set of properties to be realised in our description; let P be the set of properties known to be true of our intended referent r (we assume that P is non-empty); and let C be the set of distractors (the contrast set). The initial conditions are thus as follows:  - C = {&lt;all distractors&gt; }; - P = {&lt;all properties true of r&gt; }; - L = {} In order to describe the intended referent r with respect to the contrast set C, we do the following: 1. Check Success: if |C |= 0 then return L as a distinguishing  for some mechanism for handling what we might think of as equivalence classes of properties, and this is effectively a simple approach to this question.</Paragraph>
    <Paragraph position="2">  1. Check Success if Stack is empty then return L as a DD elseif |Cv |= 1 then pop Stack &amp; goto Step 1 elseif Pr = [?] then fail else goto Step 2 2. Choose Property for each property pi [?] Pr do</Paragraph>
    <Paragraph position="4"> if &lt;type, X&gt; [?]L for some X  available properties to be used in a description via a preference ordering over those properties; see Algorithm 3.</Paragraph>
    <Paragraph position="5"> For the purpose of this study, the algorithms were implemented in Common LISP. The mechanism described in (Dale and Reiter, 1995) to handle generalisation hierarchies for values for the different properties, referred to in the algorithm here as FindBestValue, was not implemented since, as discussed earlier, our representation of the domain does not make use of a hierarchy of properties.</Paragraph>
  </Section>
  <Section position="7" start_page="65" end_page="65" type="metho">
    <SectionTitle>
5 The Output of the Algorithms
</SectionTitle>
    <Paragraph position="0"> Using the knowledge base described in Section 3, we applied the algorithms from the previous section to see whether the referring expressions they produced were the same as, or similar to, those produced by the human subjects. This quickly gave rise to some situations not explicitly addressed by some of the algorithms; we discuss these in Section 5.1 below. Section 5.2 discusses the extent to which the behaviour of the algorithms matched that of the human data.</Paragraph>
    <Section position="1" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
5.1 Preference Orderings
</SectionTitle>
      <Paragraph position="0"> The Incremental Algorithm explicitly encodes a preference ordering over the available properties, in an attempt to model what appear to be semiconventionalised strategies for description that people use. This also has the consequence of avoiding a problem that faces the other two algorithms: since the Full Brevity Algorithm and the Relational Algorithm choose the most discriminatory property at each step, they have to deal with the case where several properties are equally discriminatory. This turns out to be a common situation in our domain. Both algorithms implicitly assume that the choice will be made randomly in these cases; however, it seems to us more natural to control this process by imposing some selection strategy. We do this here by borrowing the idea of preference ordering from the Incremental Algorithm, and using it as a tie-breaker when multiple properties are equally discriminatory.</Paragraph>
      <Paragraph position="1"> Not including type information (i.e., the fact that some di is a drawer), which has no discriminatory power and therefore will never be chosen by any of the algorithms,4 there are only four different properties available for the Full Brevity Algorithm and the Incremental Algorithm: row, column, colour, and position. This gives us 4! = 24 different possible preference orderings. Since some of the human-produced descriptions use all four properties, we tested these two algorithms with all 24 preference orderings.</Paragraph>
      <Paragraph position="2"> For the Relational Algorithm, we added the five relations next to, left of, right of, above, and below. This results in 9! = 362,880 possible preference orderings; far too many to test. Since we are primarily interested in whether the algorithm can generate the human-produced descriptions, we restricted our testing to those preference orderings that started with a permutation of the properties used by the participants; in addition to the 24 preference orderings above, there are 12 preference orderings that incorporate the relational properties.</Paragraph>
    </Section>
    <Section position="2" start_page="65" end_page="65" type="sub_section">
      <SectionTitle>
5.2 Coverage of the Human Data
</SectionTitle>
      <Paragraph position="0"> Overall, the Full Brevity Algorithm is able to generate 82 out of the 103 non-relational descriptions from the natural data, providing a recall of 79.6%.</Paragraph>
      <Paragraph position="1"> The recall score for the Incremental Algorithm is 95.1%, generating 98 of the 103 descriptions. As these algorithms do not attempt to generate relational descriptions, the relational data is not taken into account in evaluating the performance here.</Paragraph>
      <Paragraph position="2"> Both algorithms are able to generate all the non-relational minimal descriptions found in the human-produced data. The Full Brevity Algorithm unintentionally replicates the redundancy found in nine descriptions, and the Incremental Algorithm produces all but five of the 29 redundant descriptions.</Paragraph>
      <Paragraph position="3"> Perhaps surprisingly, the Relational Algorithm does not generate any of the human-produced descriptions. We will return to consider why this is the case in the next section.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="65" end_page="68" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> There are two significant differences to be considered here: first, the coverage of redundant descriptions by the Full Brevity and Incremental Algo4Consistent with much other work in the field, we assume that the head noun will always be added irrespective of whether it has any discriminatory power.</Paragraph>
    <Paragraph position="1">  rithms; and second, the inability of the Relational Algorithm to replicate any of the human data.</Paragraph>
    <Section position="1" start_page="67" end_page="67" type="sub_section">
      <SectionTitle>
6.1 Coverage of Redundancy
</SectionTitle>
      <Paragraph position="0"> Neither the Full Brevity Algorithm nor the Incremental Algorithm presumes to be able to generate relational descriptions; however, both algorithms are able to produce each of the minimal descriptions from the set of natural data with at least one of the preference orderings. Both also generate several of the redundant descriptions in the natural data set, but do not capture all of the human-generated redundancies.</Paragraph>
      <Paragraph position="1"> The Full Brevity Algorithm has as a primary goal the avoidance of redundant descriptions, so it is a sign of the algorithm being consistent with its specification that it covers fewer of the redundant expressions than the Incremental Algorithm.</Paragraph>
      <Paragraph position="2"> On the other hand, the fact that it produces any redundant descriptions signals that the algorithm doesn't quite meet its specification. The cases where the Full Brevity Algorithm produces redundancy are when an entity shares with another entity at least two property-values and, after choosing one of these properties, the next property to be considered is the other shared one, since it has the same or a higher discriminatory power than all other properties. This is a situation that was not considered in the original algorithm; it is related to the problem of what to do when two properties have the same discriminatory power, as noted earlier. In our domain, the situation arises for corner drawers with the same colour (d4 and d16), and drawers that are not in a corner but for which there is another drawer of the same colour in each of the same row and column (d7 and d8).</Paragraph>
      <Paragraph position="3"> The Incremental Algorithm, on the other hand, generates redundancy when an object shares at least two property-values with another object and the two shared properties are the first to be considered in the preference ordering. This is possible for corner drawers with the same colour (d4 and d16) and for drawers for which there is another drawer of the same colour in either the same row, the same column, or both (d5, d6, d7, d8, d10, d11, d13, d15).</Paragraph>
      <Paragraph position="4"> In these terms, the Incremental Algorithm is clearly a better model of the human behaviour than the Full Brevity Algorithm. However, we may ask why the algorithm does not cover all the redundancy found in the human descriptions. The redundant descriptions which the algorithm does not generate are as follows:  (5) the blue drawer in the top left corner [d1] (6) the yellow drawer in the top right corner [d4] (7) the pink drawer in the top of the column second from the right [d3] (8) the orange drawer in the bottom second from the right [d14] (9) the orange drawer in the bottom of the second  column from the right [d14] The Incremental Algorithm stops selecting properties when a distinguishing description has been constructed. In Example (6), for example, the algorithm would select any of the following, depending on the preference ordering used:  (10) the yellow drawer in the corner (11) the top left yellow drawer (12) the drawer in the top left corner  The human subject, however, has added information beyond what is required. This could be explained by our modelling of cornerhood: in Examples (5) and (6), one has the intuition that the noun corner is being added simply to provide a nominal head to the prepositional phrase in an incrementally-constructed expression of the form the blue drawer in the top right . . . , in much the same way as the head noun drawer is added, whereas we have treated it as a distinct property that adds discriminatory power. This again emphasises the important role the underlying representation plays in the generation of referring expressions: if we want to emulate what people do, then we not only need to design algorithms which mirror their behaviour, but these algorithms have to operate over the same kind of data.</Paragraph>
    </Section>
    <Section position="2" start_page="67" end_page="68" type="sub_section">
      <SectionTitle>
6.2 Relational Descriptions
</SectionTitle>
      <Paragraph position="0"> The fact that the Relational Algorithm generates none of the human-generated descriptions is quite disturbing. On closer examination, it transpires that this is because, in this domain, the discriminatory power of relational properties is generally always greater than that of any other property, so a relational property is chosen first. As noted earlier, relational properties appear to be dispreferred  in the human data, so the Relational Algorithm is already disadvantaged. The relatively poor performance of the algorithm is then compounded by its insistence on continuing to use relational properties: an absolute property will only be chosen when either the currently described drawer has no unused relational properties left, or the number of distractors has been reduced so much that the discriminatory power of all remaining relational properties is lower than that of the absolute property, or the absolute property has the same discriminatory power as the best relational one and the absolute property appears before all relations in the preference ordering.</Paragraph>
      <Paragraph position="1"> Consequently, whereas a typical human description of drawer d2 would be the orange drawer above the blue drawer, the Relational Algorithm will produce the description the drawer above the drawer above the drawer above the pink drawer.</Paragraph>
      <Paragraph position="2"> Not only are there no descriptions of this form in the human-produced data set, but they also sound more like riddles someone might create to intentionally make it hard for the hearer to figure out what is meant.</Paragraph>
      <Paragraph position="3"> There are a variety of ways in which the behaviour of this algorithm might be repaired. We are currently exploring whether Krahmer et al's (2003) graph-based approach to GRE is able to provide a better coverage of the data: this algorithm provides the ability to make use of different search strategies and weighting mechanisms when adding properties to a description, and such a mechanism might be used, for example, to counterbalance the Relational Algorithm's heavy bias towards the relations in this domain.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML