File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0603_metho.xml
Size: 25,561 bytes
Last Modified: 2025-10-06 14:08:24
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0603"> <Title>Understanding Complex Visually Referring Utterances</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 A Spatial Description Task </SectionTitle> <Paragraph position="0"> We designed a task that requires people to describe objects in computer generated scenes containing up to 30 objects with random positions on a virtual surface. The objects all had identical shapes and sizes, and were either green or purple in colour. Each of the objects had a 50% chance of being green, otherwise it was purple. This design naturally led speakers to make reference to spatial aspects of the scene, rather than the individual object properties which subjects tended to use in our previous work (Roy, 2002). We refer to this task as the Bishop task, and to the resulting language understanding model and implemented system simply as Bishop.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Data Collection </SectionTitle> <Paragraph position="0"> Participants in the study ranged in age from 22 to 30 years, and included both native and non-native English speakers. Pairs of participants were seated with their backs to each other, each person looking at a computer screen which displayed a scene such as that in Figure 1.</Paragraph> <Paragraph position="1"> Each screen showed the same scene. In each pair, one participant served as describer, and the other as listener.</Paragraph> <Paragraph position="2"> The describer wore a microphone that was used to record his or her speech. The describer used a mouse to select an object from the scene, and then verbally described the selected object to the listener. The listener's task was to select the same object on their own computer display based on the verbal description. If the selected objects matched, they disappeared from the scene and the describer would select and describe another object. If they did not match, the describer would re-attempt the description until understood by the listener. The scene contained 30 objects at the beginning of each session, and a session ended when no objects remained, at which point the describer and listener switched roles and completed a second session (some participants fulfilled a role multiple times).</Paragraph> <Paragraph position="3"> Initially, we collected 268 spoken object descriptions from 6 participants. The raw audio was segmented using our speech segmentation algorithm based on pause structure (Yoshida, 2002). Along with the utterances, the corresponding scene layout and target object identity were recorded together with the times at which objects were selected. This 268 utterance corpus is referred to as the development data set. We manually transcribed each spoken utterance verbatim, retaining all speech errors (false starts and various other ungrammaticalities). Off-topic speech events (laughter, questions about the task, other remarks, and filled pauses) were marked as such (they do not appear in any results we report). We wrote a simple heuristic algorithm based on time stamps to pair utterances and selections based on their time stamps. When we report numbers of utterances in data sets in this paper, they correspond to how many utterance-selection pairs our pairing algorithm produces.</Paragraph> <Paragraph position="4"> Once our implementation based on the development corpus yielded acceptable results, we collected another 179 spoken descriptions from three additional participants to evaluate generalization and coverage of our approach. The discussion and analysis in the following sections focuses on the development set. In Section 6 we discuss performance on the test set.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Descriptive Strategies for Achieving Joint Reference </SectionTitle> <Paragraph position="0"> We distinguish three subsets of our development data, 1) a set containing those utterance/selection pairs that contain errors, where an error can be due to a repair or mistake on the human speaker's part, a segmentation mistake by our speech segmenter, or an error by our utterance/selection pairing algorithm, 2) a set that contains those utterance/selection pairs that employ descriptive strategies other than those we cover in our computational understanding system (we cover those in Sections 2.2.1 to 2.2.5), and 3) the set of utterance/selection pairs in the development data that are not a member of either sub-set described above. We refer to this last subset as the 'clean' set. Note that the first two subsets are not mutually exclusive. As we catalogue descriptive strategies from the development data in the following sections, we report two percentages for each descriptive strategy. The first is the percentage of utterance/selection pairs that employ a specific descriptive strategy relative to all the utterance/selection pairs in the development data set. The second is the percentage of utterance/selection pairs relative to the clean set of utterance/selection pairs, as described above.</Paragraph> <Paragraph position="1"> Almost every utterance employs colour to pick out objects. While designing the task, we intentionally trivialized the problem of colour reference. Objects come in only two distinct colours, green and purple. Unsurprisingly, all participants used the terms &quot;green&quot; and &quot;purple&quot; to refer to these colours. Participants used colour to identify one or more objects in 96% of the data, and 95% of the clean data.</Paragraph> <Paragraph position="2"> The second most common descriptive strategy is to refer to spatial extremes within groups of objects and to spatial regions in the scene. The example in Figure 2 uses two spatial terms to pick out its referent: &quot;front&quot; and &quot;left&quot;, both of which leverage spatial extrema to direct the listener's attention. Multiple spatial specifications tend to be interpreted in left to right order, that is, selecting a group of objects matching the first term, then amongst those choosing objects that match the second term.</Paragraph> <Paragraph position="3"> &quot;the purple one in the front left corner&quot; ring to spatial extrema Being rather ubiquitous in the data, spatial extrema and spatial regions are often used in combination with other descriptive strategies like grouping, but are most frequently combined with other extrema and region specifications. Participants used single spatial extrema to identify one or more objects in 72% of the data, and in 78% of the clean data. They used spatial region specifications in 20% of the data (also 20% of the clean data), and combined multiple extrema or regions in 28% (30% of the clean data).</Paragraph> <Paragraph position="4"> To provide landmarks for spatial relations and to specify sets of objects to select from, participants used language to describe groups of objects. Figure 3 shows an example of such grouping constructs, which uses a count to specify the group (&quot;three&quot;). In this example, the participant first specifies a group containing the target object, then utters another description to select within that group. Note that grouping alone never yields an individual reference, so participants compose grouping constructs with further referential tactics (predominantly extrema and spatial relations) in all cases. Participants used grouping to identify objects in 12% of the data and 10% of the clean data.</Paragraph> <Paragraph position="5"> As already mentioned in Section 2.2.3, participants sometimes used spatial relations between objects or groups of objects. Examples of such relations are expressed through prepositions like &quot;below&quot; or &quot;behind&quot; as well as phrases like &quot;to the left of&quot; or &quot;in front of&quot;. Figure 4 shows an example that involves a spatial relation between individual objects. The spatial relation is combined with another strategy, here an extremum (as well as two speech errors by the describer). Participants used spatial relations in 6% of the data (7% of the clean data). &quot;there's a purple cone that's it's all the way on the left hand side but it's it's below another purple&quot; In a number of cases participants used anaphoric references to the previous object removed during the description task. Figure 5 shows a sequence of two scenes and corresponding utterances in which the second utterance refers back to the object selected in the first. Participants employed spatial relations in 4% of the data (3% of the clean data).</Paragraph> <Paragraph position="6"> &quot;the closest purple one on the far left side&quot; &quot;the green one right behind that one&quot; In addition to the phenomena listed in the preceding sections, participants used a small number of other description strategies. Some that occurred more than once but that we have not yet addressed in our computational model are selection by distance (lexicalised as &quot;close to&quot; or &quot;next to&quot;), selection by neighbourhood (&quot;the green one surrounded by purple ones&quot;), selection by symmetry (&quot;the one opposite that one&quot;), and selection by something akin to local connectivity (&quot;the lone one&quot;). We annotated 13% of our data as containing descriptive strategies other than the ones covered in the preceding sections. We marked 15% of our data as containing errors.</Paragraph> </Section> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Understanding Framework </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 Synthetic Vision </SectionTitle> <Paragraph position="0"> Instead of relying on the information we use to render the scenes in Bishop, which includes 3D object locations and the viewing angle, we implemented a simple synthetic vision algorithm to ease a future transfer back to a robot's vision system. This algorithm produces a map attributing each pixel of the rendered image to one of the objects or the background. In addition, we use the full colour information for each pixel drawn in the rendered scene.</Paragraph> <Paragraph position="1"> We chose to work in a virtual world for this project so that we could freely change colour, number, size, shape and arrangement of objects to elicit interesting verbal behaviours in our participants.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Lexical Entries and Concepts </SectionTitle> <Paragraph position="0"> Conceptually, we treat lexical entries like classes in an object oriented programming language. When instantiated, they maintain an internal state that can be as simple as a tag identifying the dimension along which to perform an ordering, or as complex as multidimensional probability distributions. Each entry can contain a semantic composer that encapsulates the function to combine this entry with other constituents during a parse. These composers are described in-depth in Section 4. The lexicon used for Bishop contains many lexical entries attaching different semantic composers to the same word. For example, &quot;left&quot; can be either a spatial relation or an extremum, which may be disambiguated by grammatical structure during parsing.</Paragraph> <Paragraph position="1"> During composition, structures representing the objects a constituent references are passed between lexical entries. We refer to these structures as concepts. Each entry accepts zero or more concepts, and produces zero or more concepts as the result of the composition operation.</Paragraph> <Paragraph position="2"> A concept lists the entities in the world that are possible referents of the constituent it is associated with, together with real numbers representing their ranking due to the last composition operation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.3 Parsing </SectionTitle> <Paragraph position="0"> We use a bottom-up chart parser to guide the interpretation of phrases (Allen, 1995). Such a parser has the advantage that it employs a dynamic programming strategy to efficiently reuse already computed subtrees of the parse. Furthermore, it produces all sub components of a parse and thus produces a useable result without the need to parse to a specific symbol.</Paragraph> <Paragraph position="1"> Bishop performs only a partial parse, a parse that is not required to cover a whole utterance, but simply takes the longest referring parsed segments to be the best guess.</Paragraph> <Paragraph position="2"> Unknown words do not stop the parse process. Rather, all constituents that would otherwise end before the unknown word are taken to include the unknown word, in essence making unknown words invisible to the parser and the understanding process. In this way we recover essentially all grammatical chunks and relations that are important to understanding in our restricted task.</Paragraph> <Paragraph position="3"> We use a simple grammar containing 19 rules. With each rule, we associate an argument structure for semantic composition. When a rule is syntactically complete during a parse, the parser checks whether the composers of the constituents in the tail of the rule can accept the number of arguments specified in the rule. If so, it calls the semantic composer associated with the constituent with the concepts yielded by its arguments to produce a concept for the head of the rule.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Semantic Composition </SectionTitle> <Paragraph position="0"> Most of the composers presented follow the same composition schema: they take one or more concepts as arguments and yield another concept that references a possibly different set of objects. Composers may introduce new objects, even ones that do not exist in the scene as such, and they may introduce new types of objects (e.g.</Paragraph> <Paragraph position="1"> groups of objects referenced as if they were one object).</Paragraph> <Paragraph position="2"> Most composers first convert an incoming concept to the objects it references, and subsequently perform computations on these objects. If ambiguities persist at the end of understanding an utterance (multiple possible referents exist), we let Bishop choose the one with maximum reference strength.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Colour - Probabilistic Attribute Composers </SectionTitle> <Paragraph position="0"> As mentioned in Section 3.1, we chose not to exploit the information used to render the scene, and therefore must recover colour information from the final rendered image. The colour average for the 2D projection of each object varies due to occlusion by other objects, as well as distance from and angle with the virtual camera. We separately collected a set of labelled instances of &quot;green&quot; and &quot;purple&quot; cones, and estimated a three dimensional Gaussian distribution from the average red, green and blue values of each pixel belonging to the example cones.</Paragraph> <Paragraph position="1"> When asked to compose with a given concept, this type of probabilistic attribute composer assigns each object referenced by the source concept the probability density function evaluated at the average colour of the object.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Spatial Extrema and Spatial Regions - Ordering Composers </SectionTitle> <Paragraph position="0"> To determine spatial regions and extrema, an ordering composer orders objects along a specified feature dimension (e.g. x coordinate relative to a group) and picks referents at an extreme end of the ordering. To do so, it assigns an exponential weight function to objects according to gi(1+v) for picking minimal objects, where i is the object's position in the sequence, v is its value along the feature dimension specified, normalized to range between 0 and 1 for the objects under consideration. The maximal case is weighted similarly, but using the reverse ordering subtracting the fraction in the exponent from 2. For our reported results g = 0.38. This formula lets referent weights fall off exponentially both with their position in the ordering and their distance from the extreme object. In that way extreme objects are isolated except for cases in which many referents cluster around an extremum, making picking out a single referent difficult.</Paragraph> <Paragraph position="1"> We attach this type of composer to words like &quot;leftmost&quot; and &quot;top&quot;.</Paragraph> <Paragraph position="2"> The ordering composer can also order objects according to their absolute position, corresponding more closely to spatial regions rather than spatial extrema relative to a group. The reference strength formula for this version is g(1+ ddmax) where d is the euclidean distance from a reference point, and dmax the maximum such distance amongst the objects under consideration. This version of the composer is attached to words like &quot;middle&quot;. It has the effect that reference weights are relative to absolute position on the screen. An object close to the centre of the board achieves a greater reference weight for the word &quot;middle&quot;, independently of the position of other objects of its kind. Ordering composers work across any number of dimensions by simply ordering objects by their Euclidean distance, using the same exponential falloff function as in the other cases.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Grouping Composers </SectionTitle> <Paragraph position="0"> For non-numbered grouping (e.g., when the describer says &quot;group&quot; or &quot;cones&quot;), the grouping composer searches the scene for groups of objects that are all within a maximum distance threshold from another group member. It only considers objects that are referenced by the concept it is passed as an argument. For numbered groups (&quot;two&quot;, &quot;three&quot;), the composer applies the additional constraint that the groups have to contain the correct number of objects. Reference strengths for the concept are determined by the average distance of objects within the group.</Paragraph> <Paragraph position="1"> The output of a grouping composer may be thought of as a group of groups. To understand the motivation for this, consider the utterance, &quot;the one to the left of the group of purple ones&quot;. In this expression, the phrase &quot;group of purple ones&quot; will activate a grouping composer that will find clusters of purple cones. For each cluster, the composer computes the convex hull (the minimal &quot;elastic band&quot; that encompasses all the objects) and creates a new composite object that has the convex hull as its shape. When further composition takes place to understand the entire utterance, each composite group serves as a potential landmark relative to &quot;left&quot;.</Paragraph> <Paragraph position="2"> However, concepts can be marked so that their behaviour changes to split apart concepts refering to groups.</Paragraph> <Paragraph position="3"> For example, the composer attached to &quot;of&quot; sets this flag on concepts passing through it. Note that &quot;of&quot; is only involved in composition for grammar rules of the type NP - NP P NP, but not for those performing spatial compositions for phrases like &quot;to the left of&quot;. Therefore, the phrase &quot;the frontmost one of the three green ones&quot; will pick the front object within the best group of three green objects.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Spatial Relations - Spatial Composers </SectionTitle> <Paragraph position="0"> The spatial semantic composer employs a version of the Attentional Vector Sum (AVS) suggested in (Regier and Carlson, 2001). The AVS is a measure of spatial relation meant to approximate human judgements corresponding to words like &quot;above&quot; and &quot;to the left of&quot; in 2D scenes of objects. Given two concepts as arguments, the spatial semantic composer converts both into sets of objects, treating one set as providing possible landmarks, the other as providing possible targets. The composer then calculates the AVS for each possible combination of landmarks and targets. Finally, the spatial composer divides the result by the Euclidean distance between the objects' centres of mass, to account for the fact that participants exclusively used nearby objects to select through spatial relations.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Anaphoric Composers </SectionTitle> <Paragraph position="0"> Triggered by words like &quot;that&quot; (as in &quot;to the left of that one&quot;) or &quot;previous&quot;, an anaphoric composer produces a concept that refers to a single object, namely the last object removed from the scene during the session. This object specially marks the concept as referring not to the current, but the previous visual scene, and any further calculations with this concept are performed in that visual context.</Paragraph> </Section> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Example: Understanding a Description </SectionTitle> <Paragraph position="0"> Consider the scene in Figure 6, and the output of the chart parser for the utterance, &quot;the purple one on the left&quot; in Figure 7. Starting at the top left of the parse output, the parser finds &quot;the&quot; in the lexicon as an ART (article) with a selecting composer that takes one argument. It finds two lexical entries for &quot;purple&quot;, one marked as a CADJ (colour adjective), and one as an N (noun). Each of them have the same composer, a probabilistic attribute composer marked as P(), but the adjective expects one argument whereas the noun expects none. Given that the noun expects no arguments and that the grammar contains a rule of the form NP-N, an NP (noun phrase) is instantiated and the probabilistic composer is applied to the default set of objects yielded by N, which consists of all objects visible. This composer call is marked P(N) in the chart. After composition, the NP contains a subset of only the purple objects (Figure 6, top right). At this point the parser applies NP-ART NP, which produces the NP spanning the first two words and again contains only the purple objects, but is marked as unambiguously referring to an object. S(NP) marks the application of this selecting composer called S.</Paragraph> <Paragraph position="1"> The parser goes on to produce a similar NP covering the first three words by combining the &quot;purple&quot; CADJ with &quot;one&quot; and the result with &quot;the&quot;. The &quot;on&quot; P (prepo null sition) is left dangling for the moment as it needs a constituent that follows it. It contains a modifying semantic composer that simply bridges the P, applying the first argument to the second. After another &quot;the&quot;, &quot;left&quot; has several lexical entries: in its ADJ and one of its N forms it contains an ordering semantic composer that takes a single argument, whereas its second N form contains a spatial semantic composer that takes two arguments to determine a target and a landmark object. At this point the parser can combine &quot;the&quot; and &quot;left&quot; into two possible NPs, one containing the ordering and the other the spatial composer. The first of these NPs in turn fulfills the need of the &quot;on&quot; P for a second argument according to NP-NP P NP, performing its ordering compose first on &quot;one&quot; (for &quot;one on the left&quot;), selecting all the objects on the left (Figure 6, bottom left). The application of the ordering composer is denoted as O.x.min(NP) in the chart, indicating that this is an ordering composer ordering along the x axis and selecting the minimum along this axis. On combining with &quot;purple one&quot;, the same composer selects all the purple objects on the left (Figure 6, bottom right).</Paragraph> <Paragraph position="2"> Finally on &quot;the purple one&quot;, it produces the same set of objects as &quot;purple one&quot;, but marks the concept as unambiguously picking out a single object. Note that the parser attempts to use the second interpretation of &quot;left&quot; (the one containing a spatial composer) but fails because this composer expects two arguments that are not provided by the grammatical structure of the sentence.</Paragraph> </Section> class="xml-element"></Paper>