File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/04/w04-3013_abstr.xml
Size: 5,557 bytes
Last Modified: 2025-10-06 13:44:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-3013"> <Title>Context Sensing using Speech and Common Sense</Title> <Section position="1" start_page="0" end_page="0" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> We present a method of inferring aspects of a person's context by capturing conversation topics and using prior knowledge of human behavior. This paper claims that topic-spotting performance can be improved by using a large database of common sense knowledge. We describe two systems we built to infer context from noisy transcriptions of spoken conversations using common sense, and detail some preliminary results. The GISTER system uses OMCSNet, a commonsense semantic network, to infer the most likely topics under discussion in a conversation stream. The OVERHEAR system is built on top of GISTER, and distinguishes between aspects of the conversation that refer to past, present, and future events by using LifeNet, a probabilistic graphical model of human behavior, to help infer the events that occurred in each of those three time periods. We conclude by discussing some of the future directions we may take this work.</Paragraph> <Paragraph position="1"> Introduction Can we build computers that infer a speaker's context by summarizing the conversation's gist? Once computers are able to capture the gist of a conversation, an enormous number of potential applications become possible. However, current topic-spotting methods have met with little success in characterizing spontaneous conversations involving hundreds of potential topics (Jebara et al., 2000). This paper claims that performance can be greatly improved by making use of not only the text of a speech transcription, but also perceptual and commonsensical information from the dialogue.</Paragraph> <Paragraph position="2"> We have enabled a suite of wearable computers with the ability to provide the perceptual information necessary for a human to infer a conversation's gist and predict subsequent events. To take the human fully out of the loop we have infused the system with two common-sense knowledge bases that enable the computer to make educated inferences about the user's context. Implementation Our system incorporated a Zaurus Linux handheld computer, with an 802.11b CF card and a wireless Bluetooth headset microphone. Applications were written to enable the Zaurus to stream high quality audio (22 kHz, 16-bit) to an available 802.11b network, or to store the audio locally when no network is detected. Besides streaming audio, packets in this wireless network could be 'sniffed' by the PDAs interested in determining who else is in the local proximity. Information regarding access point signal strength information was correlated with location using a static table look-up procedure. The system is typically kept in a participant's pocket, or for those with Bluetooth headsets, stored in a briefcase, purse, or backpack.</Paragraph> <Paragraph position="3"> Audio Processing and Transcription ViaVoice, a commercial speech recognition engine, is used to transcribe the audio streams, however typically word recognition rates fall below 35% for spontaneous speech recognition (Eagle & Pentland, 2002). This inaccuracy poses a serious problem for determining the gist of an interaction. However, a human can read through a noisy transcript and with adequate perceptual cues, still have an impression of the underlying conversation topic.</Paragraph> <Paragraph position="4"> Speaker 1: you do as good each key in and tell on that this this printers' rarely broken key fixed on and off-fixes and the new nine-month London deal on and then now take paper out and keep looking cartridges and then see if we confine something of saw someone to fix it but see Saddam out of the system think even do about it had tools on is there a persona for the minister what will come paper response to use the paper is not really going to stay in the printer for very much longer high is Chinese college and shredded where inks that inks is really know where the sounds like a Swiss have to have played by ear than Speaker 2: a can what can do that now I think this this seems to work on which side is working are in Speaker 1: an hour riderless I E fix the current trend the Stratton practice page of the test casings to of printed nicely I think jacking years ago that is paid toes like a printed Neisse Additional context, such as information that the conversation is occurring in an office, or more precisely, by a printer, may help many people understand that the conversation is about fixing a printer jam. Prior knowledge about the conversation participants and the time of day may also significantly augment a person's ability to infer the gist of the interaction, for example, one of the speakers could be a printer repairman. Our work suggests that the additional contextual and commonsensical information a human can employ for inference on the transcript above is equally helpful to a probabilistic model.</Paragraph> <Paragraph position="5"> As will be shown, this additional contextual and commonsense information can be used to form probabilistic models relating observed keywords to conversation topic. Thus by combining audio and information from a mobile device with a commonsense knowledge network, we can determine the gist of noisy, face-to-face conversations. In the above example, for instance, our system correctly labeled the conversation as 'printing on printer'.</Paragraph> </Section> class="xml-element"></Paper>