File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/n06-2040_metho.xml
Size: 10,391 bytes
Last Modified: 2025-10-06 14:10:11
<?xml version="1.0" standalone="yes"?> <Paper uid="N06-2040"> <Title>Sentence Planning for Realtime Navigational Instructions</Title> <Section position="3" start_page="0" end_page="157" type="metho"> <SectionTitle> 2 Dialog Collection Procedure </SectionTitle> <Paragraph position="0"> Our task setup employs a virtual-reality (VR) world in which one partner, the direction-follower (DF), moves about in the world to perform a series of tasks, such as pushing buttons to re-arrange objects in the room, picking up items, etc. The partners communicated through headset microphones.</Paragraph> <Paragraph position="1"> The simulated world was presented from rst-person perspective on a desk-top computer monitor. The DF has no knowledge of the world map or tasks.</Paragraph> <Paragraph position="2"> His partner, the direction-giver (DG), has a paper 2D map of the world and a list of tasks to complete.</Paragraph> <Paragraph position="3"> During the task, the DG has instant feedback about 00:17:19: and go through that door [D6] Figure 1: An example sequence with repositioning DG: ok, yeah, go through that door [D9, locate] turn to your right 'mkay, and there's a door [D11, vague] in there um, go through the one straight in front of you [D11, locate] ok, stop... and then turn around and look at the buttons [B18,B20,B21] ok, you wanna push the button that's there on the left by the door [B18] ok, and then go through the door [D10] look to your left there, in that cabinet there [C6, locate] the DF's location in the VR world, via mirroring of his partner's screen on his own computer monitor. The DF can change his position or orientation within the virtual world independently of the DG's directions, but since the DG knows the task, their collaboration is necessary. In this study, we are most interested in the behavior of the DG, since the algorithm we develop emulates this role. Our paid participants were recruited in pairs, and were self-identi ed native speakers of North American English.</Paragraph> <Paragraph position="4"> The video output of DF's computer was captured to a camera, along with the audio stream from both microphones. A log le created by the VR engine recorded the DF's coordinates, gaze angle, and the position of objects in the world. All 3 data sources were synchronized using calibration markers. A technical report is available (Byron, 2005) that describes the recording equipment and software used. Figure 2 is a dialog fragment in which the DG steers his partner to a cabinet, using both a sequence of target objects and three additional repositioning commands (in bold) to adjust his partner's spatial relationship with the target.</Paragraph> <Section position="1" start_page="157" end_page="157" type="sub_section"> <SectionTitle> 2.1 Developing the Training Corpus </SectionTitle> <Paragraph position="0"> We recorded fteen dialogs containing a total of 221 minutes of speech. The corpus was transcribed and word-aligned. The dialogs were further annotated using the Anvil tool (Kipp, 2004) to create a set of target referring expressions. Because we are interested in the spatial properties of the referents of these target referring expressions, the items included in this experiment were restricted to objects with a de ned spatial position (buttons, doors and cabinets). We excluded plural referring expressions, since their spatial properties are more complex, and also expressions annotated as vague or abandoned.</Paragraph> <Paragraph position="1"> Overall, the corpus contains 1736 markable items, of which 87 were annotated as vague, 84 abandoned and 228 sets.</Paragraph> <Paragraph position="2"> We annotated each referring expression with a boolean feature called Locate that indicates whether the expression is the rst one that allowed the follower to identify the object in the world, in other words, the point at which joint spatial reference was achieved. The kappa (Carletta, 1996) obtained on this feature was 0.93. There were 466 referring expressions in the 15-dialog corpus that were annotated TRUE for this feature.</Paragraph> <Paragraph position="3"> The dataset used in the experiments is a consensus version on which both annotators agreed on the set of markables. Due to the constraints introduced by the task, referent annotation achieved almost perfect agreement. Annotators were allowed to look ahead in the dialog to assign the referent. The data used in the current study is only the DG's language.</Paragraph> </Section> </Section> <Section position="4" start_page="157" end_page="159" type="metho"> <SectionTitle> 3 Algorithm Development </SectionTitle> <Paragraph position="0"> The generation module receives as input a route plan produced by a planning module, composed of a list of graph nodes that represent the route. As each subsequent target on the list is selected, content planning considers the tuple of variables a0 ID, LOCa1 where ID is an identi er for the target and LOC is the DF's location (his Cartesian coordinates and orientation angle). Target ID's are always object id's to be visited in performing the task, such as a door tures. The target obje ct is B4 and [B1, B2, B3, B4, C1, D1] are perceptually accessible.</Paragraph> <Paragraph position="1"> that the DF must pass through. The VR world updates the value of LOC at a rate of 10 frames/sec.</Paragraph> <Paragraph position="2"> Using these variables, the content planner must decide whether the DF's current location is appropriate for producing a referring expression to describe the object.</Paragraph> <Paragraph position="3"> The following features are calculated from this information: absolute Angle between target and follower's view direction, which implicitly gives the in front relation, Distance from target, visible distractors (VisDistracts), visible distractors of the same semantic category (VisSemDistracts), whether the target is visible (boolean Visible), and the target's semantic category (Cat: button/door/cabinet). Figure 3 is an example spatial con guration with these features identi ed.</Paragraph> <Section position="1" start_page="158" end_page="158" type="sub_section"> <SectionTitle> 3.1 Decision Tree Training </SectionTitle> <Paragraph position="0"> Training examples from the annotation data are tuples containing the ID of the annotated description, the LOC of the DF at that moment (from the VR engine log), and a class label: either Positive or Negative. Because we expect some latency between when the DG judges that a felicity condition is met and when he begins to speak, rather than using spatial context features that co-occur with the onset of each description, we averaged the values over a 0.3 second window centered at the onset of the expression.</Paragraph> <Paragraph position="1"> Negative contexts are dif cult to identify since they often do not manifest linguistically: the DG may say nothing and allow the user to continue moving along his current vector, or he may issue a movement command. A minimal criterion for producing an expression that can achieve joint spatial reference is that the addressee must have perceptual accessibility to the item. Therefore, negative training examples for this experiment were selected from the timeperiods that elapsed between the follower achieving perceptual access to the object (coming into the same room with it but not necessarily looking at it), but before the Locating description was spoken. In these negative examples, we consider the basic felicity conditions for producing a descriptive reference to the object to be met, yet the DG did not produce a description. The dataset of 932 training examples was balanced to contain 50% positive and 50% negative examples.</Paragraph> </Section> <Section position="2" start_page="158" end_page="159" type="sub_section"> <SectionTitle> 3.2 Decision Tree Performance </SectionTitle> <Paragraph position="0"> This evaluation is based on our algorithm's ability to reproduce the linguistic behavior of our human subjects, which may not be ideal behavior.</Paragraph> <Paragraph position="1"> The Weka1 toolkit was used to build a decision tree classi er (Witten and Frank, 2005). Figure 4 shows the resulting tree. 20% of the examples were held out as test items, and 80% were used for training with 10 fold cross validation. Based on training results, the tree was pruned to a minimum of 30 instances per leaf. The nal tree correctly classi ed a12a14a13a16a15 of the test data.</Paragraph> <Paragraph position="2"> The number of positive and negative examples was balanced, so the rst baseline is 50%. To incorporate a more elaborate baseline, we consider that a description will be made only if the referent is visible to the DF. Marking all cases where the referent was visible as describe-id and all the other examples as delay gives a higher baseline of 70%, still 16% lower than the result of our tree.2 Previous ndings in spatial cognition consider angle, distance and shape as the key factors establishing spatial relationships (Gapp, 1995), the angle deviation being the most important feature for projective spatial relationship. Our algorithm also selects Angle and Distance as informative features. VisDistracts is selected as the most important feature by the tree, suggesting that having a large number of objects to contrast makes the description harder, which is in sync with human intuition. We note that Visible is not selected, but that might be due to the fact that it reduces to Angle a1a18a17a20a19a22a21 . In terms of the referring expression generation algorithm described by (Reiter and Dale, 1992), in which the description which eliminates the most distractors is selected, our results suggest that the human subjects chose to reduce the size of the distractor set before producing a description, presumably in order to reduce the computational load required to calculate the optimal description. null The exact values of features shown in our decision tree are speci c to our environment. However, the features themselves are domain-independent and are relevant for any spatial direction-giving task, and their relative in uence over the nal decision may transfer to a new domain. To incorporate our ndings in a system, we will monitor the user's context and plan a description only when our tree predicts it.</Paragraph> </Section> </Section> class="xml-element"></Paper>