File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/w05-0614_metho.xml
Size: 20,308 bytes
Last Modified: 2025-10-06 14:09:53
<?xml version="1.0" standalone="yes"?> <Paper uid="W05-0614"> <Title>Intentional Context in Situated Natural Language Learning</Title> <Section position="5" start_page="105" end_page="105" type="metho"> <SectionTitle> 3 Linguistic Mapping </SectionTitle> <Paragraph position="0"> Given a model of intention recognition, the problem for a language learner becomes one of mapping spoken utterances onto appropriate constituents of their inferred intentional representations. Given the intention representation above, this is equivalent to mapping all of the words in an utterance to the role fillers of the appropriate semantic frame in the induced intention tree. To model this mapping procedure, we employ a noisy channel model in which the probability of inferring the correct meaning given an utterance is approximated by the (channel) probability of generating that utterance given that meaning, times the (source) prior probability of the meaning itself (see Equation 1).</Paragraph> <Paragraph position="2"> Here utterance refers to some linguistic unit (usually a sentence) and meaning refers to some node in the tree (represented as a semantic frame) inferred during intention recognition . We can use the probability associated with the inferred tree (as given by the PCFG parser) as the source probability. Further, we can learn the channel probabilities in an unsupervised manner using a variant of the EM algorithm similar to machine translation (Brown et al., 1993), and statistical language understanding (Epstein, 1996).</Paragraph> </Section> <Section position="6" start_page="105" end_page="108" type="metho"> <SectionTitle> 4 Pilot Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="105" end_page="106" type="sub_section"> <SectionTitle> 4.1 Data Collection </SectionTitle> <Paragraph position="0"> a refers to a weighting coefficient.</Paragraph> <Paragraph position="1"> In order to avoid the many physical and perceptual problems that complicate work with robots and sensor-grounded data, this work focuses on language learning in virtual environments. We focus on multiplayer videogames , which support rich types of social interactions. The complexities of these environments highlight the problems of ambiguous speech described above, and distinguish this work from projects characterized by more simplified worlds and linguistic interactions, such as SHRDLU (Winograd, 1972).</Paragraph> <Paragraph position="2"> Further, the proliferation of both commercial and military applications (e.g., Rickel et al., 2002) involving such virtual worlds suggests that they will continue to become an increasingly important area for natural language research in the future.</Paragraph> <Paragraph position="3"> in experimentation.</Paragraph> <Paragraph position="4"> In order to test our model, we developed a virtual environment based on the multi-user videogame Neverwinter Nights.</Paragraph> <Paragraph position="5"> The game, shown in Figure 2, provides useful tools for generating modules in which players can interact. The game was instrumented such that all players' speech/text language and actions are recorded during game play. For data collection, a game was designed in which a single player must navigate their way through a cavernous world, collecting specific objects, in order to escape. Subjects were paired such that one, the novice, would control the virtual character, while the other, the expert, guided her through the world. While the expert could say anything in order to tell the novice where to go and what to do, the novice was instructed not to speak, but only to follow the commands of the expert.</Paragraph> <Paragraph position="6"> inferred over the sequence of observed actions using a PCFG parser; c) the linguistic mapping algorithm examines the mappings between the utterance and all possible nodes to learn the best mapping of words given semantic roles. The purpose behind these restrictions was to elicit free and spontaneous speech that is only constrained by the nature of the task. This environment seeks to emulate the type of speech that a real situated language system might encounter: i.e., natural in its characteristics, but limited in its domain of discourse.</Paragraph> <Paragraph position="7"> The subjects in the data collection were university graduate and undergraduate students.</Paragraph> <Paragraph position="8"> Subjects (8 male, 4 female) were staggered such that the novice in one trial became the expert in the next. Each pair played the game at least five times, and for each of those trials, all speech from the expert and all actions from the novice were recorded. Table 1 shows examples of utterances recorded from game play, the observed actions associated with them, and the actions' inferred semantic frame.</Paragraph> </Section> <Section position="2" start_page="106" end_page="106" type="sub_section"> <SectionTitle> Utterance Action Frame </SectionTitle> <Paragraph position="0"> ok this time you are gonna get the axe first subjects with associated game actions and frames Data collection produces two parallel streams of information: the sequence of actions taken by the novice and the audio stream produced by the expert (figure 3a). The audio streams are automatically segmented into utterances using a speech endpoint detector, which are then transcribed by a human annotator. Each action in the sequence is then automatically parsed, and each node in the tree is replaced with a semantic frame (figure 3b).</Paragraph> <Paragraph position="1"> The data streams are then fed into the linguistic mapping algorithms as a parallel corpus of the expert's transcribed utterances and the inferred semantic roles associated with the novice's actions (figure 3c).</Paragraph> </Section> <Section position="3" start_page="106" end_page="107" type="sub_section"> <SectionTitle> 4.2 Algorithms Intention Recognition </SectionTitle> <Paragraph position="0"> As described in section 2, we represent the task model associated with the game as a set of production rules in which the right hand side consists of an intended action (e.g. &quot;find key&quot;) and the left hand side consists of a sequence of sub-actions that are sufficient to complete that action (e.g. &quot;go through door, open chest, pick_up key&quot;). By applying probabilities to the rules, intention recognition can be treated as a probabilistic context free parsing problem, following Pynadath, 1999.</Paragraph> <Paragraph position="1"> For these initial experiments we have hand-annotated the training data in order to generate the grammar used for intention recognition, estimating their maximum likelihood probabilities over the training set. In future work, we intend to examine how such grammars can be learned in conjunction with the language itself; extending research on learning task models (Nicolescu and Mataric, 2003) and work on learning PCFGs (Klein and Manning, 2004) with our own work on unsupervised language learning.</Paragraph> <Paragraph position="2"> Given the PCFG, we use a probabilistic Earley parser (Stolcke, 1994), modified slightly to output We use 65 different frames, comprised of 35 unique role fillers.</Paragraph> <Paragraph position="3"> partial trees (with probabilities) as each action is observed. Figure 4 shows a time slice of an inferred intention tree after a player mouse clicked on a lever in the game. Note that both the vertical and horizontal ambiguities that exist for this action in the game parallel the ambiguities shown in Figure 1. As described above, each node in the tree is represented as a semantic frame (see figure 4 insets), whose roles are aligned to the words in the utterances during the linguistic mapping phase.</Paragraph> </Section> <Section position="4" start_page="107" end_page="107" type="sub_section"> <SectionTitle> Linguistic Mapping </SectionTitle> <Paragraph position="0"> The problem of learning a mapping between linguistic labels and nodes in an inferred intentional tree is recast as one of learning the channel probabilities in Equation 1. Each node in a tree is treated as a simple semantic frame and the role fillers in these frames, along with the words in the utterances, are treated as a parallel corpus.</Paragraph> <Paragraph position="1"> This corpus is used as input to a standard Expectation Maximization algorithm that estimates the probabilities of generating a word given the occurrence of a role filler. We follow IBM Model 1 (Brown et al., 1993) and assume that each word in an utterance is generated by exactly one role in the parallel frame Using standard EM to learn the role to word mapping is only sufficient if one knows to which level in the tree the utterance should be mapped.</Paragraph> <Paragraph position="2"> However, because of the vertical ambiguity inherent in intentional actions, we do not know in advance which is the correct utterance-to-level mapping. To account for this, we extend the standard EM algorithm as follows (see figure 3c): 1) set uniform likelihoods for all utterance-to-level mappings 2) for each mapping, run standard EM 3) merge output distributions of EM (weighting each by its mapping likelihood) 4) use merged distribution to recalculate likelihoods of all utterance-to-level mappings 5) goto step 2</Paragraph> </Section> <Section position="5" start_page="107" end_page="108" type="sub_section"> <SectionTitle> 4.3 Experiments </SectionTitle> <Paragraph position="0"> Methodologies for evaluating language acquisition tasks are not standardized. Given our model, there exists the possibility of employing intrinsic measures of success, such as word alignment accuracy. However, we choose to measure the success of learning by examining the related (and more natural) task of language understanding.</Paragraph> <Paragraph position="1"> For each subject pair, the linguistic mapping algorithms are trained on the first four trials of game play and tested on the final trial. (This gives on average 130 utterances of training data and 30 utterances of testing data per pair.) For each utterance in the test data, we calculate the likelihood that it was generated by each frame seen in testing. We select the maximum likelihood frame as the system's hypothesized meaning for the test utterance, and examine both how often the maximum likelihood estimate exactly matches the true frame (frame accuracy), and how many of the role fillers within the estimated frame match the role fillers of the true frame (role accuracy).</Paragraph> <Paragraph position="2"> frames) from human subject game play.</Paragraph> <Paragraph position="3"> For each subject, the algorithm's parameters are optimized using data from all other subjects. We assume correct knowledge of the temporal alignment between utterances and actions. In future work, we will relax this assumption to explore the effects of not knowing which actions correspond to which utterances in time.</Paragraph> <Paragraph position="4"> To examine the performance of the model, three experiments are presented. Experiment 1 examines the basic performance of the algorithms on the language understanding task described above given uniform priors. The system is tested under two conditions: 1) using the extended EM algorithm given an unknown utterance-to-level alignment, and 2) using the standard EM algorithm given the correct utterance-to-level alignment.</Paragraph> <Paragraph position="5"> Experiment 2 tests the benefit of incorporating intentional context directly into language understanding. This is done by using the parse probability of each hypothesized intention as the source probability in Equation 1. Thus, given an utterance to understand, we cycle through all possible actions in the grammar, parse each one as if it were observed, and use the probability generated by the parser as its prior probability. By changing the weighting coefficient (a ) between the source and channel probabilities, we show the range of performances of the system from using no to-level alignment both known and unknown.</Paragraph> <Paragraph position="6"> Performance is on a language understanding task (baseline equivalent to choosing most frequent frame) Experiment 3 studies to what extent inferred tree structures are necessary when modeling language acquisition. Although, in section 1, we have presented intuitive reasons why such structures are required, one might argue that inferring trees over sequences of observed actions might not actually improve understanding performance when compared to a model trained only on the observed actions themselves. This hypothesis is tested by comparing a model trained given the correct utterance-to-level alignment (described in experiment 1) with a model in which each utterance is aligned to the leaf node (i.e. observed action) below the correct level of alignment. For example, in figure 4, this would correspond to mapping the utterance &quot;go through the door&quot;, not to &quot;GO THROUGH DOOR&quot;, but rather to</Paragraph> </Section> </Section> <Section position="7" start_page="108" end_page="109" type="metho"> <SectionTitle> &quot;CLICK_ON LEVER.&quot; 4.4 Results </SectionTitle> <Paragraph position="0"> utterance-to-level alignment both known and unknown, and compare it to a baseline of choosing the most frequent frame from the training data.</Paragraph> <Paragraph position="1"> Figure 5 shows the percentage of maximum likelihood frames chosen by the system that exactly match the intended frame (frame accuracy), as well as, the percentage of roles from the maximum likelihood frame that overlap with roles in the intended frame (role accuracy).</Paragraph> <Paragraph position="2"> As expected, the understanding performance goes down for both frames and roles when the correct utterance-to-level alignment is unknown.</Paragraph> <Paragraph position="3"> Interestingly, while the frame performance declines by 14.3%, the performance on roles only declines 6.4%. This difference is due primarily to the fact that, while the mapping from words to action role fillers is hindered by the need to examine all alignments, the mapping from words to object role fillers remains relatively robust. This is due to the fact that while each level of intention carries a different action term, often the objects described at different levels remain the same. For example, in figure 4, the action fillers &quot;TAKE&quot;, &quot;MOVE&quot;, &quot;OPEN&quot;, and &quot;PULL&quot; occur only once along the path. However, the object filler &quot;DOOR&quot; occurs multiple times. Thus, the chance that the role filler &quot;DOOR&quot; correctly maps to the word &quot;door&quot; is relatively high compared to the role filler &quot;OPEN&quot; mapping to the word &quot;open.&quot; accuracy of the system trained without knowing the correct utterance-to-level alignment, as a function of varying the a values from Equation 1.</Paragraph> <Paragraph position="4"> The graph shows that including intentional context does improve system performance when it is not given too much weight (i.e., at relatively high alpha values). This suggests that the benefit of intentional context is somewhat outweighed by the power of the learned role to word mappings.</Paragraph> <Paragraph position="5"> This asymmetry for learning words about actions vs.</Paragraph> <Paragraph position="6"> objects is well known in psychology (Gleitman, 1990) and is addressed directly in Fleischman and Roy, 2005. Looking closer, we find a strong negative correlation (r=-0.81) between the understanding performance using only channel probabilities (a =1) and the improvement obtained by including the intentional context. In other words, the better one does without context, the less context improves performance. Thus, we expect that in noisier environments (such as when speech recognition is employed) where channel probabilities are less reliable, employing intentional context will be even more advantageous.</Paragraph> <Paragraph position="7"> Experiment 3: Figure 7 shows the average performance on both frame and role accuracy for systems trained without using the inferred tree structure (on leaf nodes only) and on the full tree structure (given the correct utterance-to-level alignment). Baselines are calculated by choosing the most frequent frame from training.</Paragraph> <Paragraph position="8"> intentional tree vs. directly on observed actions It is clear from the figure that understanding performance is higher when the intentional tree is used in training. This is a direct result of the fact that speakers often speak about high-level intentions with words that do not directly refer to the observed actions. For example, after opening a door, experts often say: &quot;go through the door,&quot; for which the observed action is a simple movement (e.g., &quot;MOVE ROOMx&quot;). Also, by referring to high-level intentions, experts can describe sequences of actions that are not immediately referred to. For example, an expert might say: &quot;get the key&quot; to describe a sequence of actions that begins with &quot;CLICK_ON CHEST.&quot; Thus, the result of not learning over a parsed hierarchical Note that baselines are different for the two conditions, because there are a differing number of frames used in the leaf node only condition.</Paragraph> <Paragraph position="9"> representation of intentions is increased noise, and subsequently, poorer understanding performance.</Paragraph> </Section> <Section position="8" start_page="109" end_page="110" type="metho"> <SectionTitle> 5 Discussion </SectionTitle> <Paragraph position="0"> The results from these experiments, although preliminary, indicate that this model of language acquisition performs well above baseline on a language understanding task. This is particularly encouraging given the unconstrained nature of the speech on which it was trained. Thus, even free and spontaneous speech can be handled when modeling a constrained domain of discourse.</Paragraph> <Paragraph position="1"> In addition to performing well given difficult data, the experiments demonstrate the advantages of using an inferred intentional representation both as a contextual aid to understanding and as a representational scaffolding for language learning. More important than these preliminary results, however, is the general lesson that this work suggests about the importance of knowledge representations for situated language acquisition. As discussed in section 2, learning language about intentional action requires dealing with two distinct types of ambiguity. These difficulties cannot be handled by merely increasing the amount of data used, or switching to a more sophisticated learning algorithm. Rather, dealing with language use for situated applications requires building appropriate knowledge representations that are powerful enough for unconstrained language, yet scalable enough for practical applications. The work presented here is an initial demonstration of how the semantics of unconstrained speech can be modeled by focusing on constrained domains.</Paragraph> <Paragraph position="2"> As for scalability, it is our contention that for situated NLP, it is not a question of being able to scale up a single model to handle open-domain speech. The complexity of situated communication requires the use of domain-specific knowledge for modeling language use in different contexts. Thus, with situated NLP systems, it is less productive to focus on how to scale up single models to operate beyond their original domains. Rather, as more individual applications are tackled (e.g. cars, Notably, situated applications for which natural language interfaces are required typically have limited domains (e.g., talking to one's car doesn't require open-domain language processing).</Paragraph> <Paragraph position="3"> phones, videogames, etc.) the interesting question becomes one of how agents can learn to switch between different models of language as they interact in different domains of discourse.</Paragraph> </Section> class="xml-element"></Paper>