File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/n04-1026_metho.xml
Size: 22,691 bytes
Last Modified: 2025-10-06 14:08:52
<?xml version="1.0" standalone="yes"?> <Paper uid="N04-1026"> <Title>Predicting Emotion in Spoken Dialogue from Multiple Knowledge Sources</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Annotating Student Emotion </SectionTitle> <Paragraph position="0"> In our spoken dialogue tutoring corpus, student emotional states can only be identi ed indirectly via what is said and/or how it is said. We have developed an annotation scheme for hand labeling the student turns in our corpus with respect to three types of perceived emotions (Litman and Forbes-Riley, 2004): Negative: a strong expression of emotion such as confused, bored, frustrated, uncertain. Because a syntactic question by de nition expresses uncertainty, a turn containing only a question is by default labeled negative. An example negative turn is studenta2 in Figure 1. Evidence of a negative emotion comes from the lexical item um , computer corpus that will result from ITSPOKE's evaluation, in that both corpora are collected using the same experimental method, student pool, pre- and post-test, and physics problems. as well as acoustic and prosodic features, e.g., prior and post-utterance pausing and low pitch, energy and tempo.</Paragraph> <Paragraph position="1"> Positive: a strong expression of emotion such as con dent, interested, encouraged. An example is studenta0 in Figure 1, with its lexical expression of certainty, The engine , and acoustic and prosodic features of louder speech and faster tempo.</Paragraph> <Paragraph position="2"> Neutral: no strong expression of emotion, including weak (negative or positive) or contrasting (negative and positive) expressions, as well as no expression. Because groundings serve mainly to encourage another speaker to continue speaking, a student turn containing only a grounding is by default labeled neutral. An example is studenta1 in Figure 1. In this case, acoustic and prosodic features such as moderate loudness and tempo give evidence for the neutral label (rather than overriding it). The features mentioned in the examples above were elicited during post-annotation discussion, for expository use in this paper. To avoid in uencing the annotator's intuitive understanding of emotion expression, and because such features are not used consistently or unambiguously across speakers, our manual contains examples of labeled dialogue excerpts (as in Figure 1) with links to corresponding audio les, rather than a description of particular features associated with particular labels.</Paragraph> <Paragraph position="3"> Our work differs from prior emotion annotations of spontaneous spoken dialogues in several ways. Although much past work predicts only two classes (e.g., negative/non-negative) (Batliner et al., 2003; Ang et al., 2002; Lee et al., 2001), our experiments produced the best predictions using our three-way distinction. In contrast to (Lee et al., 2001), our classi cations are contextrelative (relative to other turns in the dialogue), and taskrelative (relative to tutoring), because like (Ang et al., 2002), we are interested in detecting emotional changes across our dialogues. Although (Batliner et al., 2003) also employ a relative classi cation, they explicitly associate speci c features with emotional utterances.</Paragraph> <Paragraph position="4"> To analyze the reliability of our annotation scheme, we randomly selected 10 transcribed dialogues from our human-human tutoring corpus, yielding a dataset of 453 student turns. (Turn boundaries were manually annotated prior to emotion annotation by a paid transcriber.) The 453 turns were separately annotated by two different annotators as negative, neutral or positive, following the emotion annotation instructions described above. The two annotators agreed on the annotations of 385/453 turns, achieving 84.99% agreement, with Kappa = 0.68.2 This inter-annotator agreement exceeds that of prior studies of emotion annotation in naturally occurring speech</Paragraph> <Paragraph position="6"> (Carletta, 1996). P(A) is the proportion of times the annotators agree, and P(E) is the proportion of agreement expected by chance.</Paragraph> <Paragraph position="7"> (e.g., agreement of 71% and Kappa of 0.47 in (Ang et al., 2002), and Kappa ranging between 0.32 and 0.42 in (Shafran et al., 2003)). As in (Lee et al., 2001), the machine learning experiments described below use only those 385 student turns where the two annotators agreed on an emotion label. Of these turns, 90 were negative, 280 were neutral, and 15 were positive.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Feature Extraction </SectionTitle> <Paragraph position="0"> For each of the 385 agreed student turns described above, we next extracted the set of features itemized in Figure 2.</Paragraph> <Paragraph position="1"> These features are used in our machine learning experiments (Section 5), and were motivated by previous studies of emotion prediction as well as by our own intuitions.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> Acoustic-Prosodic Features </SectionTitle> <Paragraph position="0"> a28 4 normalized fundamental frequency (f0) features: maximum, minimum, mean, standard deviation a28 4 normalized energy (RMS) features: maximum, minimum, mean, standard deviation a28 4 normalized temporal features: total turn duration, duration of pause prior to turn, speaking rate, amount of silence in turn Non-Acoustic-Prosodic Features a28 lexical items in turn a28 6 automatic features: turn begin time, turn end time, isTemporalBarge-in, isTemporalOverlap, #words in turn, #syllables in turn a28 6 manual features: #false starts in turn, isPriorTutorQuestion, isQuestion, isSemanticBarge-in, #canonical expressions in turn, isGrounding Identi er Features: subject, subject gender, problem Following other studies of spontaneous dialogues (Ang et al., 2002; Lee et al., 2001; Batliner et al., 2003; Shafran et al., 2003), our acoustic-prosodic features represent knowledge of pitch, energy, duration, tempo and pausing. F0 and RMS values, representing measures of pitch and loudness, respectively, are computed using Entropic Research Laboratory's pitch tracker, get f0, with no postcorrection. Turn Duration and Prior Pause Duration are calculated via the turn boundaries added during the transcription process. Speaking Rate is calculated as syllables (from an online dictionary) per second in the turn, and Amount of Silence is approximated as the proportion of zero f0 frames for the turn, i.e., the proportion of time the student was silent. In a pilot study of our corpus, we extracted raw values of these acoustic-prosodic features, then normalized (divided) each feature by the same feature's value for the rst student turn in the dialogue, and by the value for the immediately prior student turn. We found that features normalized by rst turn were the best predictors of emotion (Litman and Forbes, 2003).</Paragraph> <Paragraph position="1"> While acoustic-prosodic features address how something is said, features representing what is said are also important. Lexical information has been shown to improve speech-based emotion prediction in other domains (Litman et al., 2001; Lee et al., 2002; Ang et al., 2002; Batliner et al., 2003; Devillers et al., 2003; Shafran et al., 2003), so our rst non-acoustic-prosodic feature represents the transcription3 of each student turn as a word occurrence vector (indicating the lexical items that are present in the turn).</Paragraph> <Paragraph position="2"> The next set of non-acoustic-prosodic features are also automatically derivable from the transcribed dialogue.</Paragraph> <Paragraph position="3"> Turn begin and end times4 are retrieved from turn boundaries, as are the decisions as to whether a turn is a temporal barge-in (i.e., the turn began before the prior tutor turn ended) or a temporal overlap (i.e., the turn began and ended within a tutor turn). These features were motivated by the use of turn position as a feature for emotion prediction in (Ang et al., 2002), and the fact that measures of dialogue interactivity have been shown to correlate with learning gains in tutoring (Core et al., 2003). The number of words and syllables in a turn provide alternative ways to quantify turn duration (Litman et al., 2001).</Paragraph> <Paragraph position="4"> The last set of 6 non-acoustic-prosodic features represent additional syntactic, semantic, and dialogue information that had already been manually annotated in our transcriptions, and thus was available for use as predictors; as future research progresses, this information might one day be computed automatically. Our transcriber labels false starts (e.g., I do-don't), syntactic questions, and semantic barge-ins. Semantic barge-ins occur when a student turn interrupts a tutor turn at a word or pause boundary. Unlike temporal barge-ins, semantic barge-ins do not overlap temporally with tutor turns. Our transcriber also labels certain canonical expressions that occur frequently in our tutoring dialogues and function as hedges or groundings. Examples include uh , mm-hm , ok , etc. (Evens, 2002) have argued that hedges can indicate emotional speech (e.g., uncertainty ). However, many of the same expressions also function as groundings, which generally correspond to neutral turns in our dialogues. We distinguish groundings as turns that consist only of a labeled canonical expression and are not logue, e.g., the begin time of tutora27 in Figure 1 is 9.1 minutes. preceded by (i.e., not answering) a tutor question.5 Finally, we recorded 3 identi er features for each turn. Prior studies (Oudeyer, 2002; Lee et al., 2002) have shown that subject and gender can play an important role in emotion recognition, because different genders and/or speakers can convey emotions differently. sub-ject and problem are uniquely important in our tutoring domain, because in contrast to e.g., call centers, where every caller is distinct, students will use our system repeatedly, and problems are repeated across students.</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="metho"> <SectionTitle> 5 Emotion Prediction using Learning </SectionTitle> <Paragraph position="0"> We next performed machine learning experiments using the feature sets in Figure 3, to study the effects that various feature combinations had on predicting emotion.</Paragraph> <Paragraph position="1"> We compare our normalized acoustic-prosodic feature set (speech) with 3 non-acoustic-prosodic feature sets, which we will refer to as text-based sets: one containing only the lexical items in the turn (lexical), another containing the lexical items and the automatic features (autotext), and a third containing all 13 features (alltext). We further compare each of these 4 feature sets with an identical set supplemented with our 3 identi er features (+ident sets).</Paragraph> <Paragraph position="2"> a28 speech: 12 normalized acoustic-prosodic features a28 lexical: lexical items in turn a28 autotext: lexical + 6 automatic features a28 alltext: lexical + 6 automatic + 6 manual features We use the Weka machine learning software (Witten and Frank, 1999) to automatically learn our emotion prediction models. In earlier work (Litman and Forbes, 2003), we used Weka to compare a nearest-neighbor classi er, a decision tree learner, and a boosting algorithm. We found that the boosting algorithm, called AdaBoost (Freund and Schapire, 1996), consistently yielded the most robust performance across feature sets and evaluation metrics; in this paper we thus focus on AdaBoost's performance. Boosting algorithms generally enable the accuracy of a weak learning algorithm to be improved by repeatedly applying it to different distributions of training examples (Freund and Schapire, 1996). Following (Oudeyer, 2002), we select the decision tree learner as AdaBoost's weak learning algorithm.</Paragraph> <Paragraph position="3"> To investigate how well our emotion data can be learned with only speech-based or text-based features, Table 1 shows the mean accuracy (percent correct) and 5This de nition is consistent but incomplete, e.g., repeats can also function as groundings, but are not currently included. standard error (SE)6 of AdaBoost on the 8 feature sets from Figure 3, computed across 10 runs of 10-fold crossvalidation.7 Although not shown in this and later tables, all of the feature sets examined in this paper predict emotion signi cantly better than a standard majority class baseline algorithm (always predict neutral , which yields an accuracy of 72.74%). For Table 1, AdaBoost's improvement for each feature set, relative to this baseline error of 27.26%, averages 24.40%, and ranges between As shown in Table 1, the best accuracy of 84.70% is achieved on the alltext+ident feature set. This accuracy is signi cantly better than the accuracy of the seven other feature sets,9 although the difference between the +/-ident versions was not signi cant for any other pair besides alltext . In addition, the results of ve of the six text-based feature sets are signi cantly better than the results of both acoustic-prosodic feature sets ( speech +/- ident ). Only the text-only feature set ( lexical-ident ) did not perform statistically better than speech+ident (although it did perform statistically better than speech-ident ). These results show that while acoustic-prosodic features can be used to predict emotion signi cantly better than a majority class baseline, using only non-acoustic-prosodic features consistently produces even signi cantly better results. Furthermore, the more text-based features the better, i.e., supplementing lexical items with additional features consistently yields further accuracy increases. While adding in the subjectand problem- speci c +ident features improves the accuracy of all the -ident feature sets, the improvement is only signi cant for the highest-performing set ( alltext ). The next question we addressed concerns whether combinations of acoustic-prosodic and other types of fea- null dence interval. If the con dence intervals for two feature sets are non-overlapping, then their mean accuracies are signi cantly different with 95% con dence.</Paragraph> <Paragraph position="4"> tures can further improve AdaBoost's predictive accuracy. We investigated AdaBoost's performance on the set of 6 feature sets formed by combining the speech acoustic-prosodic set with each text-based set, both with and without identi er features, as shown in Table 2.</Paragraph> <Paragraph position="5"> AdaBoost's best accuracy of 84.26% is achieved on the alltext+speech+ident combined feature set. This result is signi cantly better than the % correct achieved on the four autotext and lexical combined feature sets, but is not signi cantly better than the alltext+speech-ident feature set. Furthermore, there was no signi cant difference between the results of the autotext and lexical combined feature sets, nor between the -ident and +ident versions for the 6 combined feature sets.</Paragraph> <Paragraph position="6"> Comparing the results of these combined (speech+text) feature sets with the speech versus text results in Table 1, we nd that for autotext+speech-ident and all +ident feature sets, the combined feature set slightly decreases predictive accuracy when compared to the corresponding text-only feature set. However, there is no signi cant difference between the best results in each table (all null In addition to accuracy, other important evaluation metrics include recall, precision, and F-Measure</Paragraph> <Paragraph position="8"> formance with respect to these metrics across emotion classes for the alltext+speech+ident feature set, using leave-one-out cross validation (LOO). AdaBoost accuracy here is 82.08%. As shown, AdaBoost yields the best performance for the neutral (majority) class, and has better performance for negatives than for positives. We also found positives to be the most dif cult emotion to annotate. Overall, however, AdaBoost performs signi cantly better than the baseline, whose precision, recall and F-measure for negatives and positives is 0, and for neutrals is 0.727, 1, and 0.842, respectively.</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="metho"> <SectionTitle> 6 Adding Context-Level Features </SectionTitle> <Paragraph position="0"> Research in other domains (Litman et al., 2001; Batliner et al., 2003) has shown that features representing the dialogue context can sometimes improve the accuracy of predicting negative user states, compared to the use of features computed from only the turn to be predicted.</Paragraph> <Paragraph position="1"> Thus, we investigated the impact of supplementing our turn-level features in Figure 2 with the features in Figure 4, representing local and global10 aspects of the prior dialogue, respectively.</Paragraph> <Paragraph position="2"> a28 Local Features: feature values for the two student turns preceding the student turn to be predicted a28 Global Features: running averages and totals for each feature, over all student turns preceding the We next performed machine learning experiments using our two original speech-based feature sets ( speech +/- ident ), and four of our text-based feature sets ( autotext and alltext +/- ident), each separately supplemented with local, global, and local+global features. Table 4 presents the results of these experiments.</Paragraph> <Paragraph position="3"> AdaBoost's best accuracy of 83.85% is achieved on the alltext+glob-ident combined feature set. This result is not signi cantly better than the % correct achieved on its +ident counterpart, but both of these results are signi cantly better than the % correct achieved on all other 16 feature sets. Moreover, all of the results for both the alltext and autotext feature sets were signi cantly better than the results for all of the speech feature sets. Although the alltext+loc feature sets were not signi cantly better than the best autotext feature sets (autotext+glob), they were better than the remaining autotext feature sets, and the alltext+loc+glob feature sets were better than all of the autotext feature sets. For all feature sets, the difference between the -ident and 10Running totals are only computed for numeric features if the result is interpretable, e.g., for turn duration, but not for tempo. Running averages for text-based features additionally include a # turns so far feature and a # essays so far feature. +ident versions was not signi cant. In sum, we see again that the more text-based features the better: adding text-based features again consistently improves results signi cantly. We also see that global features perform better than local features, and while global+local perform better than local features, global features alone consistently yield the best performance.</Paragraph> <Paragraph position="4"> Comparing these results with the results in Tables 1 and 2, we nd that while overall the performance of contextual non-combined feature sets shows a small performance increase over most non-contextual combined or non-combined feature sets, there is again a slight decrease in performance across the best results in each table. However, there is no signi cant difference between these best results (alltext+glob-identvs. alltext+speech+ident vs. alltext+ident).</Paragraph> <Paragraph position="5"> Table 5 shows the results of combining speech-based and text-based contextual feature sets. We investigated AdaBoost's performance on the 12 feature sets formed by combining the speech acoustic-prosodic set with our autotext and alltext text-based feature sets, both with and without identi er features, and each separately supplemented with local, global, and local+global features. AdaBoost's best accuracy of 84.75% is achieved on the alltext+speech+glob-ident combined feature set.</Paragraph> <Paragraph position="6"> This result is not signi cantly better than the % correct achieved on its +ident counterpart, but both results are signi cantly better than the % correct achieved on all 10 other feature sets. In fact, all the alltext results are signi cantly better than all the autotext results. Again for all feature sets, the difference between the -ident and +ident versions was not signi cant. In sum, adding text-based features again consistently improves results signi cantly, and global features alone consistently yield the best performance. Although the best result across all experiments is that of alltext + speech + glob - ident , there is no signi cant difference between the best results here and those in our three other experimental conditions. A summary gure of our best results for text (alltext) and speech alone, then combined with each other and with our best result for context (global), is shown in Figure 5, for the +/- ident conditions; baseline performance is also shown. As shown, the accuracy of the -ident condition monotonically increases as features are added or replaced in the right-to-left order shown.</Paragraph> <Paragraph position="7"> The +ident condition initially increases, then decreases with the addition of global or speech features to the alltext feature set, but then slightly increases again when these feature sets are combined. With less features +ident typically outperforms -ident , although this switches when alltext and global features are</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="metho"> <SectionTitle> 7 Feature Usage in Machine Learning </SectionTitle> <Paragraph position="0"> As discussed above, we use AdaBoost to boost a decision tree algorithm. Although the Weka output of AdaBoost does not include a decision tree, to get an intuition about how our features are used to predict emotion classes in our domain, we ran the basic decision tree algorithm on our highest-performing feature set, alltext+speech+glob-ident . Table 6 shows the feature types used in this feature set, and the feature usages of each based on the structure of the tree. Following (Ang et al., 2002), feature usage is reported as the percentage of decisions for which the feature type is queried. As shown, the turn-based (non-context) text-based features are the most highly queried, with lexical items and manual features queried most, followed by the temporal (speech-based) features. Manual text-based global features are queried far more than other global features.</Paragraph> </Section> class="xml-element"></Paper>