File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-3402_metho.xml
Size: 8,988 bytes
Last Modified: 2025-10-06 14:11:01
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-3402"> <Title>Off-Topic Detection in Conversational Telephone Speech</Title> <Section position="5" start_page="8" end_page="9" type="metho"> <SectionTitle> 4 Data </SectionTitle> <Paragraph position="0"> In our work we use human-transcribed conversations from the Fisher data (LDC, 2004). In each conversation, participants have been given a topic to discuss for ten minutes. Despite this, participants often talk about subjects that are not at all related to the assigned topic. Therefore, a convenient way to define irrelevance in conversations in this domain is segments which do not contribute to understanding the assigned topic. This very natural definition makes the domain a good one for initial study; however, the idea can be readily extended to other domains.</Paragraph> <Paragraph position="1"> For example, broadcast debates, class lectures, and meetings usually have specific topics of discussion.</Paragraph> <Paragraph position="2"> The primary transactional goal of participants in the telephone conversations is to discuss the assigned topic. Since this goal directly involves the act of discussion itself, it is not surprising that participants often talk about the current conversation or the choice of topic. There are enough such segments that we assign them a special region type: Metaconversation. The purely irrelevant segments we call Small Talk, and the remaining segments are defined as On-Topic. We define utterances as segments of speech that are delineated by periods and/or speaker changes. An annotated excerpt is shown in Table 1.</Paragraph> <Paragraph position="3"> For the experiments described in this paper, we selected 20 conversations: 4 from each of the topics &quot;computers in education&quot;, &quot;bioterrorism&quot;, &quot;terrorism&quot;, &quot;pets&quot;, and &quot;censorship&quot;. These topics were chosen randomly from the 40 topics in the Fisher corpus, with the constraint that we wanted to include topics that could be a part of normal small talk (such as &quot;pets&quot;) as well as topics which seem farther removed from small talk (such as &quot;censorship&quot;).</Paragraph> <Paragraph position="4"> Our selected data set consists of slightly more than 5,000 utterances. We had 2-3 human annotators label the utterances in each conversation, choosing from the 3 labels Metaconversation, Small Talk, and On-Topic. On average, pairs of annotators agreed with each other on 86% of utterances. The main source of annotator disagreement was between Small Talk and On-Topic regions; in most cases this resulted from differences in opinion of when exactly the conversation had drifted too far from the topic to be relevant.</Paragraph> <Paragraph position="5"> For the 14% of utterances with mismatched labels, we chose the label that would be &quot;safest&quot; in the information retrieval context where small talk might get discarded. If any of the annotators thought a given utterance was On-Topic, we kept it On-Topic.</Paragraph> <Paragraph position="6"> If there was a disagreement between Metaconversation and Small Talk, we used Metaconversation. Thus, a Small Talk label was only placed if all annotators agreed on it.</Paragraph> </Section> <Section position="6" start_page="9" end_page="10" type="metho"> <SectionTitle> 5 Experimental Setup 5.1 Features </SectionTitle> <Paragraph position="0"> As indicated in Section 1, we apply machine learning algorithms to utterances extracted from telephone conversations in order to learn classifiers for Small Talk, Metaconversation, and On-Topic. We represent utterances as feature vectors, basing our selection of features on both linguistic insights and earlier text classification work. As described in Section 2, work on the linguistics of conversational category, as ranked by the feature quality measure (Lewis and Gale, 1994).</Paragraph> <Paragraph position="1"> speech (Cheepen, 1988; Laver, 1981) implies that the following features might be indicative of small talk: (1) position in the conversation, (2) the use of present-tense verbs, and (3) a lack of common helper words such as &quot;it&quot;, &quot;there&quot;, and forms of &quot;to be&quot;. To model the effect of proximity to the beginning of the conversation, we attach to each utterance a feature that describes its approximate position in the conversation. We do not include a feature for proximity to the end of the conversation because our transcriptions include only the first ten minutes of each recorded conversation.</Paragraph> <Paragraph position="2"> In order to include features describing verb tense, we use Brill's part-of-speech tagger (Brill, 1992) .</Paragraph> <Paragraph position="3"> Each part of speech (POS) is taken to be a feature, whose value is a count of the number of occurrences in the given utterance.</Paragraph> <Paragraph position="4"> To account for the words, we use a bag of words model with counts for each word. We normalize words from the human transcripts by converting everything to lower case and tokenizing contractions</Paragraph> <Section position="1" start_page="10" end_page="10" type="sub_section"> <SectionTitle> Features Values </SectionTitle> <Paragraph position="0"> n word tokens for each word, # occurrences standard POS tags as in Penn Treebank for each tag, # occurrences line number in conversation 0-4, 5-9, 10-19, 20-49, >49 utterance type statement, question, fragment utterance length (number of words) 1, 2, ..., 20, >20 number of laughs laugh count n word tokens in previous 5 utterances for each word, total # occurrences in 5 previous tags from POS tagger, previous 5 for each tag, total # occurrences in 5 previous number of words, previous 5 total from 5 previous number of laughs, previous 5 total from 5 previous n word tokens, subsequent 5 utterances for each word, total # occ in 5 subsequent tags from POS tagger, subsequent 5 for each tag, total # occurrences in 5 subsequent number of words, subsequent 5 total from 5 subsequent number of laughs, subsequent 5 total from 5 subsequent and punctuation. We rank the utility of words according to the feature quality measure presented in (Lewis and Gale, 1994) because it was devised for the task of classifying similarly short fragments of text (news headlines), rather than long documents.</Paragraph> <Paragraph position="1"> We then consider the top n tokens as features, varying the number in different experiments. Table 2 shows the most useful tokens for distinguishing between the three categories according to this metric.</Paragraph> <Paragraph position="2"> Additionally, we include as features the utterance type (statement, question, or fragment), number of words in the utterance, and number of laughs in the utterance.</Paragraph> <Paragraph position="3"> Because utterances are long enough to classify individually but too short to classify reliably, we not only consider features of the current utterance, but also those of previous and subsequent utterances.</Paragraph> <Paragraph position="4"> More specifically, summed features are calculated for the five preceding utterances and for the five subsequent utterances. The number five was chosen empirically. null It is important to note that there is some overlap in features. For instance, the token &quot;?&quot; can be extracted as one of the n word tokens by Lewis and Gale's feature quality measure; it is also tagged by the POS tagger; and it is indicative of the utterance type, which is encoded as a separate feature as well.</Paragraph> <Paragraph position="5"> However, redundant features do not make up a significant percentage of the overall feature set.</Paragraph> <Paragraph position="6"> Finally, we note that the conversation topic is not taken to be a feature, as we cannot assume that conversations in general will have such labels. The complete list of features, along with their possible values, is summarized in Table 3.</Paragraph> </Section> <Section position="2" start_page="10" end_page="10" type="sub_section"> <SectionTitle> 5.2 Experiments </SectionTitle> <Paragraph position="0"> We applied several classifier learning algorithms to our data: Naive Bayes, Support Vector Machines (SVMs), 1-nearest neighbor, and the C4.5 decision tree learning algorithm. We used the implementations in the Weka package of machine learning algorithms (Witten and Frank, 2005), running the algorithms with default settings. In each case, we performed 4-fold cross-validation, training on sets consisting of three of the conversations in each topic (15 conversations total) and testing on sets of the remaining 1 from each topic (5 total). Average training set size was approximately 3800 utterances, of which about 700 were Small Talk and 350 Metaconversation. The average test set size was 1270.</Paragraph> </Section> </Section> class="xml-element"></Paper>