File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-3013_intro.xml
Size: 11,612 bytes
Last Modified: 2025-10-06 14:02:49
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3013"> <Title>Context Sensing using Speech and Common Sense</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 4 OVERHEAR </SectionTitle> <Paragraph position="0"> The OVERHEAR system is a newer system, built on top of GISTER, and distinguishes between aspects of the conversation that refer to past, present, and future events. The system relies on LifeNet, a probabilistic graphical model of human behavior, to infer the events occurring in each of those three time periods.</Paragraph> <Paragraph position="1"> We have two reasons for trying to distinguish between past, present, and future events. First, using additional sensory context (such as addition information about the speakers' location) to bias the results of gist sensing only works when the conversation is referring to the present context. Often, people's conversations referred to things that happened in the past, or things they were planning to do in the future, and in those cases sensory context only hurt GISTER's performance.</Paragraph> <Paragraph position="2"> However, one could imagine making use of recorded, time-stamped sensory data to bias the gisting of conversations that were talking about events that happened earlier.</Paragraph> <Paragraph position="3"> The structure of LifeNet is represented by a Markov network whose structure resembles a Dynamic Bayesian Network (DBN). Although lacking the 'explaining away' power of true Bayesian inference, the model is not constrained to directed acyclic graphs. LifeNet is designed to support the same kinds of temporal inferences as a DBN, including predicting future states and guessing prior states from the current state.</Paragraph> <Paragraph position="4"> LifeNet was built as a probabilistic graphical model because stochastic methods can be more tolerant than traditional logical reasoning methods to the uncertainty in our knowledge of the situation, as well as to the uncertainty in the reliability of the rules themselves. Additionally these methods have efficient and well-known inference procedures for generating approximate solutions. null Second, our long term goal is to use context sensing from speech to build new types of context-aware applications for wearable computers and other mobile devices. An application that knew that the speaker was referring to past events could perform tasks like retrieve documents and e-mails that referred to those past events. However, if the speaker was referring to the current situation, the application could know to make use of sensory information to improve its understanding of the current context. And if the speaker was referring to potential future events, like 'going to a movie this weekend', the application could assist the user by making plans to help make those events happen (or not happen, as the case may be), for instance by offering to purchase movie tickets on-line.</Paragraph> <Paragraph position="5"> Our early experiments reasoning with LifeNet treat it as Markov network, an undirected graphical model where the nodes represent random variables and the edges joint probability constraints relating those variables. We convert LifeNet into a series of joint probabilities (the details of this process are described later this paper), and we reason with the resulting network using local belief updating techniques. We engage in 'loopy' belief propagation as described by Pearl (Pearl, 1988). Belief propagation in a Markov network is straightforward. We use the following belief updating rules, as described in (Yedidia et al., 2000):</Paragraph> <Paragraph position="7"> LifeNet is a probabilistic graphical model that captures a first-person model of human experience. It relates 80,000 'egocentric' propositions with 415,000 temporal and atemporal links between these propositions, as shown in Figure 2. More details about how the LifeNet model is generated are given in (Singh & Williams, 2003).</Paragraph> <Paragraph position="9"> ps and the joint probability of a pair of linked nodes andi j by ij ps . The message sent from nodei to j is denoted by m . ij ()N i is the set of all neighbors of node , and i ()\N ijrepresents the set of all neighbors of node i except for node j . is a normalization constant.</Paragraph> <Paragraph position="10"> a</Paragraph> <Paragraph position="12"> I put child to bed I turn out the light An armchair is here A television is here I flip a switch A television stand is here</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Before After </SectionTitle> <Paragraph position="0"> These simple updating rules run fairly quickly even on a knowledge base as large as LifeNet. In our optimized Common Lisp implementation, on a 1.8 GHz Pentium 4 with 512 MB ram, a single iteration of belief updating runs in 15 seconds. Inference is further sped up by restricting the belief updating to run only on nodes within a fixed distance from the evidence nodes. Given a single evidence node and using only those nodes within a depth of three edges away, a single iteration of belief updating runs in as little as 0.5 seconds for some nodes; on average it takes about 3 seconds.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Model Integration and Implementation </SectionTitle> <Paragraph position="0"> GISTER leverages the commonsense facts within OMCSNet to generate discrete conceptual topics from a given transcript segmented into twenty-word-long observations, with each twenty-word observation independent from the others. We extended GISTER to infer the average tense of the text within the observation by detecting verb tenses, auxiliary verbs like did and will, and also specific temporal expressions like yesterday and tomorrow. LifeNet then allows us to calculate the transition probabilites to a given specific propositional representation based on previous states.</Paragraph> <Paragraph position="1"> By using the independent output of GISTER as input into LifeNet, we are able to improve the inferences of a user's context which subsequently can be used to training data for improved models of human behavior.</Paragraph> <Paragraph position="2"> As shown in Figure 3, by using the output of GISTER for inference in LifeNet, additional insight can be gained about the user's situation. If the output from GISTER is 'eating sushi' and was assigned a past tense, while 'going to the doctor' was assigned a future tense, LifeNet can make educated inferences about what happened to the user. This inference can be fed back into the lower level of the model by weighting words like 'sick, full, tired', and rerunning the semantic filtering technique. By incorporating this feedback into the system, the filtering technique would be much less likely to exclude words related to being sick despite them being initally filtered from the transcript. If the gister's output changes, the process continues until the two systems converge on a solution.</Paragraph> <Paragraph position="3"> We propose a variation to the Markov network implementation of LifeNet described in section 4.1.</Paragraph> <Paragraph position="4"> Noisy transcript and signal data is still used as initial input into the system; GISTER then processes this data, semantically filters the speech, and calculates the likely subjects of conversation and their tenses. Highly ranked output from GISTER is then used as temporal observations for inference on the LifeNet model, as shown in Figure 3. These observations are linked to specific nodes within LifeNet that correspond to the given tense (past, present, future). We used multiple root nodes with weights proportional to the rank generated from the gister. This belief weighting system accounts for the uncertainty of the gister's output while starting with multiple roots enables much richer inference.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Preliminary Results </SectionTitle> <Paragraph position="0"> The system was initially tested on an office worker's conversation regarding how she had eaten too much the day before and that she will have to go to the doctor's office during the next day. The following transcripts were input into GISTER: PAST: had sushi for lunch could then have thought so he and then so yesterday's the sushi I its while I was at the Senate Committee lunch it tasted good sign yet was expenses over a cost me $7 to buy six roles and they lead to much of its in the rules were not a very good and I ate too many roles half so after words about six hours later I wasn't feeling very well of this so more Matsushita I never bought some sugar before usually advised chicken sandwich usual and normal food there I thought that this issue would be a good deal I also bought some seltzer water was so worked well and silence uggg.. ate a ton of sushi last night...</Paragraph> <Paragraph position="1"> think i'm going to have to see a doctor tomorrow... I eat sushi.</Paragraph> <Paragraph position="2"> I go to the doctor's office.</Paragraph> <Paragraph position="3"> I feel sick.</Paragraph> <Paragraph position="4"> I FUTURE: of debt reduction appointment tomorrow they can see mental tomorrow to clock will meet Dr. Smith and he's going to put my stomach because of what I a yesterday bomb I'm hoping that when I'll feel better so looking forward to going In this experience an overall tense was assigned to each passage. GISTER correctly inferred that the first passage referred to past events and the second to future events, and output potential topics of the conversation for each of those time periods: The topics generated by GISTER in Table 3 were subsquently used as observational inputs to the next section of the model. These topics were mapped to the past and future nodes within LifeNet, marked as 'true', and then we ran the loopy belief propogation algorithm described earlier. The solution converged on nodes representing the present state, in-between the first tier (past) and the third tier (future). The nodes deemed most likely by the system are listed in Table 4 below. Training Future Models of Human Behavior When this system is deployed on many users over an extended period of time, information about people's behavior can begin to influence the initial priors from LifeNet. Although it has not been determined how additional links could be made, this represents an alternative method for increasing the common sense knowledge stored within LifeNet. Additionally, extensive observations on the same people could augment the original commonsense model by better reflecting an individual's behavior.</Paragraph> <Paragraph position="5"> Conclusions Combining common sense with speech and other types of sensory context presents abundant opportunities within a wide range of fields, from artificial intelligence and ubiquitous computing to traditional social science. By integrating two common sense knowledge bases, we have developed a method for inferring human behavior from noisy transcripts and sensor data. As mobile phones and PDAs become ever more embedded in society, the additional contextual information they provide will become invaluable for a variety of applications. This paper has shown the potential for these devices to leverage this additional information to begin understanding informal face-to-face conversations and inferring a user's context.</Paragraph> </Section> </Section> class="xml-element"></Paper>