File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1045_metho.xml

Size: 25,082 bytes

Last Modified: 2025-10-06 14:08:58

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1045">
  <Title>Predicting Student Emotions in Computer-Human Tutoring Dialogues</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Computer-Human Dialogue Data
</SectionTitle>
    <Paragraph position="0"> Our data consists of student dialogues with ITSPOKE (Intelligent Tutoring SPOKEn dialogue system) (Litman and Silliman, 2004), a spoken dialogue tutor built on top of the Why2-Atlas conceptual physics text-based tutoring system (VanLehn et al., 2002). In ITSPOKE, a student rst types an essay answering a qualitative physics problem. ITSPOKE then analyzes the essay and engages the student in spoken dialogue to correct misconceptions and to elicit complete explanations.</Paragraph>
    <Paragraph position="1"> First, the Why2-Atlas back-end parses the student essay into propositional representations, in order to nd useful dialogue topics. It uses 3 different approaches (symbolic, statistical and hybrid) competitively to create a representation for each sentence, then resolves temporal and nominal anaphora and constructs proofs using abductive reasoning (Jordan et al., 2004). During the dialogue, student speech is digitized from microphone input and sent to the Sphinx2 recognizer, whose stochastic language models have a vocabulary of 1240 words and are trained with 7720 student utterances from evaluations of Why2-Atlas and from pilot studies of IT-SPOKE. Sphinx2's best transcription (recognition output) is then sent to the Why2-Atlas back-end for syntactic, semantic and dialogue analysis. Finally, the text response produced by Why2-Atlas is sent to the Cepstral text-to-speech system and played to the student. After the dialogue, the student revises the essay, thereby ending the tutoring or causing another round of tutoring/essay revision.</Paragraph>
    <Paragraph position="2"> Our corpus of dialogues with ITSPOKE was collected from November 2003 - April 2004, as part of an evaluation comparing ITSPOKE, Why2-Atlas, and human tutoring (Litman et al., 2004). Subjects are University of Pittsburgh students who have never taken college physics, and who are native English speakers. Subjects rst read a small document of background physics material, then work through 5 problems (dialogues) with ITSPOKE. The corpus contains 100 dialogues (physics problems) from 20 subjects, with a total of 2445 student turns and 398 unique words. 15 dialogues have been annotated for emotion as described in Section 3. On average, our dialogues last 19.4 minutes and contain 25 student turns. While ITSPOKE's word error rate on this corpus is 31.2%, semantic accuracy is more useful for dialogue evaluation as it does not penalize for unimportant word errors. Semantic analysis based on speech recognition is the same as based on perfect transcription 92.4% of the time. An emotionannotated corpus example is shown in Figure 1.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Annotating Student Turns
</SectionTitle>
    <Paragraph position="0"> In our data, student emotions 1 can only be identi ed indirectly: via what is said and/or how it is 1We use the term emotion loosely to cover both affects and attitudes that can impact student learning.</Paragraph>
    <Paragraph position="1"> . . . dialogue excerpt at 18.3 min. into session. . .</Paragraph>
    <Paragraph position="2"> ITSPOKEa0a2a1 : What is the magnitude of the acceleration of the packet in the horizontal direction?</Paragraph>
    <Paragraph position="4"> vertical direction affect the motion of a body in a horizontal direction in a different scenario. Say an apple falls from a tree. What force(s) are acting on the apple as it falls?</Paragraph>
    <Paragraph position="6"> ing on the apple as it falls?</Paragraph>
    <Paragraph position="8"> apple as it falls to the earth is the earth's gravitational force. In which direction does gravitational force act?</Paragraph>
    <Paragraph position="10"> cuss a scheme for manually annotating student turns in a human-human tutoring dialogue corpus for intuitively perceived emotions.2 These emotions are viewed along a linear scale, shown and de ned as follows: negative a12a14a13 neutral a13a16a15 positive.</Paragraph>
    <Paragraph position="11"> Negative: a student turn that expresses emotions such as confused, bored, irritated. Evidence of a negative emotion can come from many knowledge sources such as lexical items (e.g., I don't know in studenta1a4a3 in Figure 1), and/or acoustic-prosodic features (e.g., prior-turn pausing in studenta1a17a3a19a18a16a0a8a9 ). Positive: a student turn expressing emotions such as con dent, enthusiastic. An example is studenta0a2a1 , which displays louder speech and faster tempo.</Paragraph>
    <Paragraph position="12"> Neutral: a student turn not expressing a negative or positive emotion. An example is studenta0a5a0 , where evidence comes from moderate loudness, pitch and tempo.</Paragraph>
    <Paragraph position="13"> We also distinguish Mixed: a student turn expressing both positive and negative emotions.</Paragraph>
    <Paragraph position="14"> To avoid in uencing the annotator's intuitive understanding of emotion expression, and because particular emotional cues are not used consistently 2Weak and strong expressions of emotions are annotated. or unambiguously across speakers, our annotation manual does not associate particular cues with particular emotion labels. Instead, it contains examples of labeled dialogue excerpts (as in Figure 1, except on human-human data) with links to corresponding audio les. The cues mentioned in the discussion of Figure 1 above were elicited during post-annotation discussion of the emotions, and are presented here for expository use only. (Litman and Forbes-Riley, 2004) further details our annotation scheme and discusses how it builds on related work.</Paragraph>
    <Paragraph position="15"> To analyze the reliability of the scheme on our new computer-human data, we selected 15 transcribed dialogues from the corpus described in Section 2, yielding a dataset of 333 student turns, where approximately 30 turns came from each of 10 subjects. The 333 turns were separately annotated by two annotators following the emotion annotation scheme described above.</Paragraph>
    <Paragraph position="16"> We focus here on three analyses of this data, itemized below. While the rst analysis provides the most ne-grained distinctions for triggering system adaptation, the second and third (simpli ed) analyses correspond to those used in (Lee et al., 2001) and (Batliner et al., 2000), respectively. These represent alternative potentially useful triggering mechanisms, and are worth exploring as they might be easier to annotate and/or predict.</Paragraph>
    <Paragraph position="17"> a20 Negative, Neutral, Positive (NPN): mixeds are con ated with neutrals.</Paragraph>
    <Paragraph position="18"> a20 Negative, Non-Negative (NnN): positives, mixeds, neutrals are con ated as nonnegatives. null a20 Emotional, Non-Emotional (EnE): negatives, positives, mixeds are con ated as Emotional; neutrals are Non-Emotional.</Paragraph>
    <Paragraph position="19"> Tables 1-3 provide a confusion matrix for each analysis summarizing inter-annotator agreement.</Paragraph>
    <Paragraph position="20"> The rows correspond to the labels assigned by annotator 1, and the columns correspond to the labels assigned by annotator 2. For example, the annotators agreed on 89 negatives in Table 1.</Paragraph>
    <Paragraph position="21"> In the NnN analysis, the two annotators agreed on the annotations of 259/333 turns achieving 77.8% agreement, with Kappa = 0.5. In the EnE analysis, the two annotators agreed on the annotations of 220/333 turns achieving 66.1% agreement, with Kappa = 0.3. In the NPN analysis, the two annotators agreed on the annotations of 202/333 turns achieving 60.7% agreement, with Kappa = 0.4. This inter-annotator agreement is on par with that of prior studies of emotion annotation in naturally occurring computer-human dialogues (e.g., agreement of 71% and Kappa of 0.47 in (Ang et al., 2002), Kappa of 0.45 and 0.48 in (Narayanan, 2002), and Kappa ranging between 0.32 and 0.42 in (Shafran et al., 2003)). A number of researchers have accommodated for this low agreement by exploring ways of achieving consensus between disagreed annotations, to yield 100% agreement (e.g (Ang et al., 2002; Devillers et al., 2003)). As in (Ang et al., 2002), we will experiment below with predicting emotions using both our agreed data and consensus-labeled data.</Paragraph>
    <Paragraph position="22"> negative non-negative  negative 89 36 non-negative 38 170 Table 1: NnN Analysis Confusion Matrix emotional non-emotional emotional 129 43 non-emotional 70 91 Table 2: EnE Analysis Confusion Matrix negative neutral positive negative 89 30 6 neutral 32 94 38 positive 6 19 19</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Extracting Features from Turns
</SectionTitle>
    <Paragraph position="0"> For each of the 333 student turns described above, we next extracted the set of features itemized in Figure 2, for use in the machine learning experiments described in Section 5.</Paragraph>
    <Paragraph position="1"> Motivated by previous studies of emotion prediction in spontaneous dialogues (Ang et al., 2002; Lee et al., 2001; Batliner et al., 2003), our acoustic-prosodic features represent knowledge of pitch, energy, duration, tempo and pausing. We further restrict our features to those that can be computed automatically and in real-time, since our goal is to use such features to trigger online adaptation in ITSPOKE based on predicted student emotions. F0 and RMS values, representing measures of pitch and loudness, respectively, are computed using Entropic Research Laboratory's pitch tracker, get f0, with no post-correction. Amount of Silence is approximated as the proportion of zero f0 frames for the turn. Turn Duration and Prior Pause Duration are computed  automatically via the start and end turn boundaries in ITSPOKE logs. Speaking Rate is automatically calculated as #syllables per second in the turn.</Paragraph>
    <Paragraph position="2"> While acoustic-prosodic features address how something is said, lexical features representing what is said have also been shown to be useful for predicting emotion in spontaneous dialogues (Lee et al., 2002; Ang et al., 2002; Batliner et al., 2003; Devillers et al., 2003; Shafran et al., 2003). Our rst set of lexical features represents the human transcription of each student turn as a word occurrence vector (indicating the lexical items that are present in the turn). This feature represents the ideal performance of ITSPOKE with respect to speech recognition. The second set represents ITSPOKE's actual best speech recognition hypothesis of what is said in each student turn, again as a word occurrence vector. null Finally, we recorded for each turn the 3 identi er features shown last in Figure 2. Prior studies (Oudeyer, 2002; Lee et al., 2002) have shown that subject and gender can play an important role in emotion recognition. Subject and problem are particularly important in our tutoring domain because students will use our system repeatedly, and problems are repeated across students.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Predicting Student Emotions
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.1 Feature Sets and Method
</SectionTitle>
      <Paragraph position="0"> We next created the 10 feature sets in Figure 3, to study the effects that various feature combinations had on predicting emotion. We compare an acoustic-prosodic feature set ( sp ), a human-transcribed lexical items feature set ( lex ) and an ITSPOKE-recognized lexical items feature set ( asr ). We further compare feature sets combining acoustic-prosodic and either transcribed or recognized lexical items ( sp+lex , sp+asr ). Finally, we compare each of these 5 feature sets with an identical set supplemented with our 3 identi er features ( +id ).</Paragraph>
      <Paragraph position="1"> sp: 12 acoustic-prosodic features lex: human-transcribed lexical items asr: ITSPOKE recognized lexical items sp+lex: combined sp and lex features sp+asr: combined sp and asr features +id: each above set + 3 identi er features  We use the Weka machine learning software (Witten and Frank, 1999) to automatically learn our emotion prediction models. In our human-human dialogue studies (Litman and Forbes, 2003), the use of boosted decision trees yielded the most robust performance across feature sets so we will continue their use here.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.2 Predicting Agreed Turns
</SectionTitle>
      <Paragraph position="0"> As in (Shafran et al., 2003; Lee et al., 2001), our rst study looks at the clearer cases of emotional turns, i.e. only those student turns where the two annotators agreed on an emotion label.</Paragraph>
      <Paragraph position="1"> Tables 4-6 show, for each emotion classi cation, the mean accuracy (%correct) and standard error (SE) for our 10 feature sets (Figure 3), computed across 10 runs of 10-fold cross-validation.3 For comparison, the accuracy of a standard baseline algorithm (MAJ), which always predicts the majority class, is shown in each caption. For example, Table 4's caption shows that for NnN, always predicting the majority class of non-negative yields an accuracy of 65.65%. In each table, the accuracies are labeled for how they compare statistically to the relevant baseline accuracy (a21 = worse, a22 = same, a23 = better), as automatically computed in Weka using a two-tailed t-test (p a24 .05).</Paragraph>
      <Paragraph position="2"> First note that almost every feature set significantly outperforms the majority class baseline, across all emotion classi cations; the only exceptions are the speech-only feature sets without identi er features ( sp-id ) in the NnN and EnE tables, which perform the same as the baseline. These results suggest that without any subject or task speci c information, acoustic-prosodic features alone 3For each cross-validation, the training and test data are drawn from utterances produced by the same set of speakers. A separate experiment showed that testing on one speaker and training on the others, averaged across all speakers, does not signi cantly change the results.</Paragraph>
      <Paragraph position="3"> are not useful predictors for our two binary classication tasks, at least in our computer-human dialogue corpus. As will be discussed in Section 6, however, sp-id feature sets are useful predictors in human-human tutoring dialogues.</Paragraph>
      <Paragraph position="4">  Further note that adding identi er features to the -id feature sets almost always improves performance, although this difference is not always signi cant4; across tables the +id feature sets out-perform their -id counterparts across all feature sets and emotion classi cations except one (NnN asr ). Surprisingly, while (Lee et al., 2002) found it useful to develop separate gender-based emotion prediction models, in our experiment, gender is the only identi er that does not appear in any learned model. Also note that with the addition of identi er features, the speech-only feature sets (sp+id) now do outperform the majority class baselines for all three emotion classi cations.</Paragraph>
      <Paragraph position="5"> 4For any feature set, the mean +/- 2*SE = the 95% condence interval. If the con dence intervals for two feature sets are non-overlapping, then their mean accuracies are signi cantly different with 95% con dence.</Paragraph>
      <Paragraph position="6"> With respect to the relative utility of lexical versus acoustic-prosodic features, without identi er features, using only lexical features ( lex or asr ) almost always produces statistically better performance than using only speech features ( sp ); the only exception is NPN lex , which performs statistically the same as NPN sp . This is consistent with others' ndings, e.g., (Lee et al., 2002; Shafran et al., 2003). When identi er features are added to both, the lexical sets don't always signi cantly outperform the speech set; only in NPN and EnE lex+id is this the case. For NnN, just as using sp+id rather than sp-id improved performance when compared to the majority baseline, the addition of the identi er features also improves the utility of the speech features when compared to the lexical features.</Paragraph>
      <Paragraph position="7"> Interestingly, although we hypothesized that the lex feature sets would present an upper bound on the performance of the asr sets, because the human transcription is more accurate than the speech recognizer, we see that this is not consistently the case. In fact, in the -id sets, asr always signi cantly outperforms lex . A comparison of the decision trees produced in either case, however, does not reveal why this is the case; words chosen as predictors are not very intuitive in either case (e.g., for NnN, an example path through the learned lex decision tree says predict negative if the utterance contains the word will but does not contain the word decrease). Understanding this result is an area for future research. Within the +id sets, we see that lex and asr perform the same in the NnN and NPN classi cations; in EnE lex+id signi cantly outperforms asr+id . The utility of the lex features compared to asr also increases when combined with the sp features (with and without identi ers), for both NnN and NPN.</Paragraph>
      <Paragraph position="8"> Moreover, based on results in (Lee et al., 2002; Ang et al., 2002; Forbes-Riley and Litman, 2004), we hypothesized that combining speech and lexical features would result in better performance than either feature set alone. We instead found that the relative performance of these sets depends both on the emotion classi cation being predicted and the presence or absence of id features. Although consistently with prior research we nd that the combined feature sets usually outperform the speech-only feature sets, the combined feature sets frequently perform worse than the lexical-only feature sets. However, we will see in Section 6 that combining knowledge sources does improve prediction performance in human-human dialogues.</Paragraph>
      <Paragraph position="9"> Finally, the bolded accuracies in each table summarize the best-performing feature sets with and without identi ers, with respect to both the %Corr gures shown in the tables, as well as to relative improvement in error reduction over the baseline (MAJ) error5, after excluding all the feature sets containing lex features. In this way we give a better estimate of the best performance our system could accomplish, given the features it can currently access from among those discussed. These best-performing feature sets yield relative improvements over their majority baseline errors ranging from 1936%. Moreover, although the NPN classi cation yields the lowest raw accuracies, it yields the highest relative improvement over its baseline.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
5.3 Predicting Consensus Turns
</SectionTitle>
      <Paragraph position="0"> Following (Ang et al., 2002; Devillers et al., 2003), we also explored consensus labeling, both with the goal of increasing our usable data set for prediction, and to include the more dif cult annotation cases. For our consensus labeling, the original annotators revisited each originally disagreed case, and through discussion, sought a consensus label.</Paragraph>
      <Paragraph position="1"> Due to consensus labeling, agreement rose across all three emotion classi cations to 100%. Tables 79 show, for each emotion classi cation, the mean accuracy (%correct) and standard error (SE) for our  A comparison with Tables 4-6 shows that overall, using consensus-labeled data decreased the performance across all feature sets and emotion classi cations. This was also found in (Ang et al., 2002).</Paragraph>
      <Paragraph position="2"> Moreover, it is no longer the case that every feature  set performs as well as or better than their baselines6; within the -id sets, NnN sp and EnE lex perform signi cantly worse than their baselines. However, again we see that the +id sets do consistently better than the -id sets and moreover always outperform the baselines.</Paragraph>
      <Paragraph position="3"> We also see again that using only lexical features almost always yields better performance than using only speech features. In addition, we again see that the lex feature sets perform comparably to the asr feature sets, rather than outperforming them as we rst hypothesized. And nally, we see again that while in most cases combining speech and lexical features yields better performance than using only speech features, the combined feature sets in most cases perform the same or worse than the lexical feature sets. As above, the bolded accuracies summarize the best-performing feature sets from each emotion classi cation, after excluding all the feature sets containing lex to give a better estimate of actual system performance. The best-performing feature sets in the consensus data yield an 11%-19% relative improvement in error reduction compared to the majority class prediction, which is a lower error reduction than seen for agreed data. Moreover, the NPN classi cation yields the lowest accuracies and the lowest improvements over its baseline.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Comparison with Human Tutoring
</SectionTitle>
    <Paragraph position="0"> While building ITSPOKE, we collected a corresponding corpus of spoken human tutoring dialogues, using the same experimental methodology as for our computer tutoring corpus (e.g. same sub-ject pool, physics problems, web and audio interface, etc); the only difference between the two corpora is whether the tutor is human or computer.</Paragraph>
    <Paragraph position="1"> As discussed in (Forbes-Riley and Litman, 2004), two annotators had previously labeled 453 turns in this corpus with the emotion annotation scheme discussed in Section 3, and performed a preliminary set of machine learning experiments (different from those reported above). Here, we perform the exper- null iments from Section 5.2 on this annotated human tutoring data, as a step towards understand the differences between annotating and predicting emotion in human versus computer tutoring dialogues.</Paragraph>
    <Paragraph position="2"> With respect to inter-annotator agreement, in the NnN analysis, the two annotators had 88.96% agreement (Kappa = 0.74). In the EnE analysis, the annotators had 77.26% agreement (Kappa = 0.55).</Paragraph>
    <Paragraph position="3"> In the NPN analysis, the annotators had 75.06% agreement (Kappa = 0.60). A comparison with the results in Section 3 shows that all of these gures are higher than their computer tutoring counterparts.</Paragraph>
    <Paragraph position="4"> With respect to predictive accuracy, Table 10 shows our results for the agreed data. A comparison with Tables 4-6 shows that overall, the human-human data yields increased performance across all feature sets and emotion classi cations, although it should be noted that the human-human corpus is over 100 turns larger than the computer-human corpus. Every feature set performs signi cantly better than their baselines. However, unlike the computer-human data, we don't see the +id sets performing better than the -id sets; rather, both sets perform about the same. We do see again the lex sets yielding better performance than the sp sets.</Paragraph>
    <Paragraph position="5"> However, we now see that in 5 out of 6 cases, combining speech and lexical features yields better performance than using either sp or lex alone. Finally, these feature sets yield a relative error reduction of 42.45%-77.33% compared to the majority class predictions, which is far better than in our computer tutoring experiments. Moreover, the EnE classi cation yields the highest raw accuracies and relative improvements over baseline error.</Paragraph>
    <Paragraph position="6"> We hypothesize that such differences arise in part due to differences between the two corpora: 1) student turns with the computer tutor are much shorter than with the human tutor (and thus contain less emotional content - making both annotation and prediction more dif cult), 2) students respond to the computer tutor differently and perhaps more idiosyncratically than to the human tutor, 3) the computer tutor is less exible than the human tutor (allowing little student initiative, questions, groundings, contextual references, etc.), which also effects student emotional response and its expression.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML