File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2326_metho.xml
Size: 33,023 bytes
Last Modified: 2025-10-06 14:09:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-2326"> <Title>Annotating Student Emotional States in Spoken Tutoring Dialogues</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Prior Research on Emotion </SectionTitle> <Paragraph position="0"> Developing a descriptive theory of emotion is a complex research topic, viewed from either a theoretical or an empirical standpoint (Cowie et al., 2001). Some researchers have proposed a variety of fundamental human emotions, while others have argued that emotions spoken dialogue tutors (e.g. (Litman and Forbes, 2003)). are best represented componentially, in terms of multiple dimensions. Despite this lack of a well-de ned descriptive framework, there has been great recent interest in predicting emotional states, using information extracted from a person's text, speech, physiology, facial expressions, eye gaze, etc. (Pantic and Rothkrantz, 2003). In the area of emotional speech, most research has used databases of speech read by actors or native speakers as training data for developing emotion predictors (Holzapfel et al., 2002; Liscombe et al., 2003). In this work the set of emotions to be read is prede ned before the utterance is spoken, rather than annotated after the fact. One problem with this approach is that such prototypical emotional speech does not necessarily reect natural speech (Batliner et al., 2003), e.g. the way one acts an emotion is not necessarily the same as the way one naturally expresses an emotion. Moreover, actors repeatedly reading the same sentence are restricted to conveying different emotions using only acoustic and prosodic features, while in natural interactions a much wider feature variety is available (e.g., lexical, dialogue). As a result of these problems, researchers motivated by spoken dialogue applications have instead started to train emotion predictors using naturally-occurring speech that has been hand-annotated for various emotions (Ang et al., 2002; Batliner et al., 2003; Lee et al., 2001; Litman and Forbes, 2003). However, this requires researchers to rst develop a scheme for annotating emotions in naturally-occurring spoken dialogue corpora. Although emotion annotation of natural corpora (typically at the turn or utterance level) has been addressed in various domains, little has yet been done in the educational setting. Although not yet tested, (Evens, 2002) has hypothesized adaptive strategies; for example, if detecting frustration, the system should respond to hedges and self-deprecation, by supplying praise and restructuring the problem. A comparison of our annotation scheme and prior non-tutoring schemes is presented in Section 4.4.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The ITSPOKE System and Corpora </SectionTitle> <Paragraph position="0"> In ITSPOKE, a student types an essay answering a qualitative physics problem. The ITSPOKE computer tutor then engages the student in spoken dialogue to correct misconceptions and elicit more complete explanations, after which the student revises the essay, thereby ending the tutoring or causing another round of tutoring/essay revision. Student speech is digitized from microphone input and sent to the Sphinx2 recognizer, whose most probable transcription output is then sent to the Why2-Atlas back-end for syntactic, semantic and dialogue analysis.</Paragraph> <Paragraph position="1"> The text response produced by Why2-Atlas is sent to the Cepstral text-to-speech system. A formal evaluation of ITSPOKE began in November 2003; to date we have collected 50 dialogues from 10 students. A corpus example is shown in Figure 4, Appendix A. Corpus collection uses the same experimental procedure as our human-human tutoring corpus, described next.</Paragraph> <Paragraph position="2"> Our Human-Human Spoken Dialogue Tutoring Corpus contains spoken dialogues collected via a web interface supplemented with a high-quality audio link, where the human tutor performs the same task as ITSPOKE. The experimental procedure for collecting both corpora is as follows: 1) students are given a pre-test measuring their physics knowledge, 2) students read through a small document of background material, 3) students use the web and voice interface to work through a set of training problems (dialogues) with the tutor, and 4) students are given a post-test that is similar to the pre-test. Subjects are University of Pittsburgh students who have never taken college physics and who are native English speakers. One tutor currently participates. To date we have collected 149 dialogues from 17 students. Annotated (see Section 4) corpus examples are shown in Figure 1 and Figure 2 (Appendix A) (punctuation added for clarity).</Paragraph> <Paragraph position="3"> . . . dialogue excerpt at 5.2 minutes into session. . .</Paragraph> <Paragraph position="4"> TUTORa0 : Suppose you apply equal force by pushing them.</Paragraph> <Paragraph position="5"> Then uh what will happen to their motion? STUDENTa1 : Um the one that's heavier...uh, the acc- acceleration won't be as great. (NEGATIVE, UNCERTAIN)</Paragraph> <Paragraph position="7"> if the mass is more and force is the same then which one will accelerate more?</Paragraph> <Paragraph position="9"> uh you are applying Newton's law of uh second law of motion: F is equal to M times A. And uh you apply equal force on both the containers, then the one which is less massive will accelerate more.</Paragraph> <Paragraph position="11"/> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Annotation Scheme </SectionTitle> <Paragraph position="0"> In our spoken dialogue tutoring corpora, student emotional states can only be identi ed indirectly via what a student says and/or how s/he says it. Furthermore, such evidence is not always obvious, unambiguous, or consistent. For example, a student may express anger through the use of swear words, or through a particular tone of voice, or via a combination of signals, or not at all. Moreover, another student may present some of these same signals even when s/he does not feel anger.</Paragraph> <Paragraph position="1"> Our objective is nevertheless to develop a reliable annotation scheme across annotators, for manually labeling the student turns in our spoken tutoring dialogues for perceived expressions of emotion.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Emotion Classes </SectionTitle> <Paragraph position="0"> In our current annotation scheme, perceived expressions of emotion are viewed along a linear scale, as shown and de ned below: negative a9a11a10 neutral a10a13a12 positive Negative: a student turn that strongly expresses emotions such as confused, bored, irritated, uncertain, sad.</Paragraph> <Paragraph position="1"> Examples in Figure 1 include studenta14 and studenta15a16a15 .</Paragraph> <Paragraph position="2"> Evidence2 for the negative emotions in these turns includes syntax (constructions such as questions), dis uencies, and acoustic-prosodic features.</Paragraph> <Paragraph position="3"> Positive: a student turn that strongly expresses emotions such as con dent, enthusiastic. An example is studenta15a6a17 in Figure 1, where evidence of a positive emotion comes primarily from acoustic-prosodic features.</Paragraph> <Paragraph position="4"> Neutral: a student turn not strongly expressing a negative or positive emotion.</Paragraph> <Paragraph position="5"> In addition to these three main emotion classes, we also distinguish three minor emotion classes: Weak Negative: a student turn that weakly expresses negative emotions.</Paragraph> <Paragraph position="6"> Weak Positive: a student turn that weakly expresses positive emotions. An example is studenta15a6a18 in Figure 1, where evidence is primarily lexical ( right ).</Paragraph> <Paragraph position="7"> Mixed: a student turn that strongly expresses both positive and negative emotions: Case 1) multi-utterance turns where one utterance is judged positive and another, negative. Case 2) turns where the simultaneous strong expression of negative and positive emotions is perceived. Case 2 is often due to con icting domains (Section 4.2), e.g.</Paragraph> <Paragraph position="8"> boredom with tutoring but con dence about physics.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Relativity and Domains of Emotion Classes </SectionTitle> <Paragraph position="0"> Our emotion annotation is relative to both context and task. By context-relative we mean that a student turn in our tutoring dialogues is identi ed as expressing emotion relative to the other student turns in that dialogue. By task-relative we mean that a student turn perceived during tutoring as expressing an emotion might not be perceived as expressing the same emotion with the same strength in another situation. For example, consider the context of a tutoring session, where a student has been answering tutor questions with apparent ease. If the tutor then asks another question, and the student responds slowly, 2Determined in post-annotation discussion (see Section 4.4).</Paragraph> <Paragraph position="1"> saying Um, now I'm confused , this turn would likely be labeled negative. However, in the context of a heated argument between two people, this same turn might be labeled as a weak negative, or even weak positive.</Paragraph> <Paragraph position="2"> We also annotate emotion with respect to multiple domains. One focus of our annotation scheme is expressions of emotion that pertain to the physics material being learned ( PHYS domain). For example, a student may express confusion or con dence about the physics material. Another focus of our scheme is expressions of emotion that pertain to the tutoring process, including attitudes towards the tutor, the dialogue, and/or being tutored ( TUT domain). For example, a student may express boredom or amusement with the tutoring.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Speci c Annotation Instructions </SectionTitle> <Paragraph position="0"> Our annotation scheme is detailed in an online, audio-enhanced emotion labeling manual. As shown in Figure 3 (Appendix A), the emotion annotation is performed using (our customization of) Wavesurfer, an open source sound visualization and manipulation tool. The Tutor Speech and Student Speech panes show a portion of the tutor and student speech les, while the Tutor Text and Student Text show the associated transcriptions, where vertical lines correspond to turn segmentations.3 There are three additional panes for emotion annotation: The EMOa pane records the annotator's judgment of the expressed emotion class for each turn, e.g. the six emotion classes described in Section 4.1: negative, weak negative, neutral, weak positive, positive, mixed. Annotators are instructed to focus on expressed emotions in the PHYS domain. If an additional expressed emotion in the TUT domain is perceived, this is noted in the NOTES pane (e.g. amused/TUT ). If no expressed emotion is perceived in the PHYS domain, any expressed emotion in the TUT domain is labeled in the EMOa pane, and noted (e.g. TUT ) in the NOTES pane. Domain indecision is also noted (e.g. TUT/PHYS? ) in the NOTES pane.</Paragraph> <Paragraph position="1"> The EMOb pane further speci es the annotations in the EMOa pane, by recording a speci c expressed emotion for each turn. Our current list of speci c emotions contains those that we believe will be useful for triggering ITSPOKE adaptation. Speci c negative emotions are: uncertain, confused, sad, bored, irritated. Speci c positive emotions are: con dent, enthusiastic. Our manual includes glosses for these speci c emotions, formulated using synonyms and/or hyponyms that are currently not distinguished. For example, our gloss for enthusiastic includes interested, pleased, amused. There are also complex labels combining multiple speci c emotions within a class (e.g. uncertain+sad, con dent+enthusiastic). If</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3Transcription and turn-segmentation of the human-human </SectionTitle> <Paragraph position="0"> dialogues were also done within Wavesurfer, by a paid transcriber prior to emotion annotation.</Paragraph> <Paragraph position="1"> the annotator judges a speci c emotion that is not listed (or lacks a close substitute), s/he selects the label other, and lists the alternative(s) in the NOTES pane. If the annotator selected mixed (case 1) in the EMOa pane, s/he subdivides the turn into utterances in the EMOb pane and provides a speci c emotion label for each utterance. If the annotator selected mixed (case 2) in the EMOa pane, s/he selects the label other in the EMOb pane, and comments on the indecision in the NOTES pane.</Paragraph> <Paragraph position="2"> The NOTES pane records any additional annotator comments concerning their judgment, the annotation, etc.</Paragraph> <Paragraph position="3"> Because our annotation is student-, context-, and taskspeci c, our manual rst instructs the annotator to listen to each dialogue at least once before annotating, to secure an intuition of how and with what range emotional expression is displayed. S/he is also instructed to not assume that all dialogues will begin with neutral student turns. S/he is however reminded that it is not necessary to assign a non-neutral label to every turn. Finally, s/he is told to ignore correctness when annotating, because a correct answer to a tutor question can express uncertainty, and an incorrect answer can express con dence.</Paragraph> <Paragraph position="4"> Our manual also describes two default conventions for our annotation scheme, which can however be overridden by the annotator's intuitive judgment and/or other extenuating considerations (e.g. irony, etc), as described below: 1) By de nition, a question expresses strong uncertainty or confusion. Thus if a student turn consists only of a question, its default label is negative. However: a) If the turn consists of multiple utterances, one of which is a question, and the other(s) expresses a positive emotion, then the turn should be labeled mixed and sub-divided (e.g. What directions are the forces acting in? Gravity is only acting in the down direction ).</Paragraph> <Paragraph position="5"> b) The domain must be considered. For example, defaults in one domain can be overridden if the turn expresses a contrasting emotion in the other domain.</Paragraph> <Paragraph position="6"> 2) Many student turns in our dialogues are very short, containing only grounding phrases such as yeah , ok , mm-hm , uh-huh , etc. By default, such turns are labeled neutral, because groundings serve mainly to encourage another speaker to continue speaking. However: a) Groundings may occasionally strongly express an emotion (e.g. yeah! , (sigh) ok ), thereby overriding the default label.</Paragraph> <Paragraph position="7"> b) The semantics of certain groundings is associated with weakly expressed understanding, (e.g. right and sure ), and default to weak positive.</Paragraph> <Paragraph position="8"> c) Certain phrases are associated with strongly expressed uncertainty or confusion (e.g. um (silence)), and default to negative.</Paragraph> <Paragraph position="9"> Our annotation manual concludes with 8 examples of annotated student turns (as in Figure 1), with links to corresponding audio les. The variety exempli es how different students express emotions differently at different points in the dialogue, and cover all 6 emotion labels at least once (there are 2 negatives and 2 positives). Also provided is a lengthy audio-enhanced transcript from a single student tutoring dialogue, to exemplify how student emotion changes throughout a single tutoring session. This transcript is shown in part in Figure 2, Appendix A. The transcript is organized in terms of tutor and student turn start and end times. For each student turn, the four Wavesurfer panes are shown.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 Comparison with Prior Schemes </SectionTitle> <Paragraph position="0"> Studies of actor-read speech often make a large number of emotion distinctions, e.g. the LDC Emotional Prosody corpus distinguishes 15 classes. Our work, like other studies of naturally occurring dialogues, uses a more restricted set of emotions, due to the need to rst manually annotate such emotions reliably across annotators. As discussed above, our annotation scheme distinguishes negative, neutral, and positive emotions, as well as weak and mixed classes. Other studies of naturally occurring data have annotated only two emotion classes (e.g. emotional/non-emotional (Batliner et al., 2000), negative/non-negative (Lee et al., 2001)). The study of (Ang et al., 2002) annotates six emotion classes, but collapses most of these for the purposes of emotion prediction.4 In Section 5, we will similarly explore the impact of collapsing some of our 6 distinctions, to produce simpler 3-way (negative/positive/neutral) and 2-way (negative/non-negative and emotional/non-emotional) schemes.</Paragraph> <Paragraph position="1"> In further contrast to (Lee et al., 2001), our annotations are context- and task-relative, because like (Ang et al., 2002; Batliner et al., 2003), we are interested in detecting emotional changes across our dialogues. But unlike (Batliner et al., 2003), we allow annotators to be guided by their intuition rather than a set of expected features, to avoid restricting or otherwise in uencing their intuitive understanding of emotion expression, and because such features are not used consistently or unambiguously across speakers. Instead, our manual contains annotated audio-enhanced corpus examples (as in Figures 1-2).</Paragraph> <Paragraph position="2"> 5 Analysis of the Annotation Scheme Given our complete annotation scheme in Section 4, we now explore both the reliability of the scheme at three levels of granularity that have been proposed in prior work, and the accuracy of automatically predicting these variations. These analyses give insight into the tradeoff 4(Ang et al., 2002) also discussesthe use of an uncertainty label, although it did not improve inter-annotator agreement. Our weak labels are more similar to an intensity dimension found in studies of elicited speech (see (Cowie et al., 2001)). between interannotator reliability, annotation granularity, and predictive accuracy.</Paragraph> <Paragraph position="3"> For the purposes of these analyses, we randomly selected 10 transcribed and turn-annotated dialogues from our human-human tutoring corpus (Section 3), yielding 453 student turns from 9 subjects. The turns were separately annotated by two annotators, using the emotion annotation instructions in Section 4. For our machine-learning experiments we follow the methodology in (Litman and Forbes, 2003), instantiated with the learning method (boosted decision trees) and feature set (acousticprosodic, lexical, dialogue and contextual) that has given us our best results in ongoing studies.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Agreed Student Turns </SectionTitle> <Paragraph position="0"> Con ating Minor and Neutral Classes For our rst analysis, only our three main emotion classes were distinguished: negative, neutral, positive. Our three minor classes, weak negative, mixed, weak positive, were con ated with the neutral class. A confusion matrix summarizing the resulting inter-annotator agreement is shown in Table 1. The rows correspond to the labels assigned by annotator 1, and the columns correspond to the labels assigned by annotator 2. For example, 90 negatives were agreed upon by both annotators, while 6 negatives assigned by annotator 1 were labeled as neutral by annotator 2. The two annotators agreed on the annotations of 385/453 turns, achieving 84.99% agreement (Kappa = 0.68 (Carletta, 1996)). Such agreement is expected given the dif culty of the task, and exceeds that of prior studies of emotion annotation in naturally occurring speech; (Ang et al., 2002), for example, achieved agreement of 71% (Kappa 0.47), while (Lee et al., 2001) averaged around 70% agreement.</Paragraph> <Paragraph position="1"> As in (Lee et al., 2001), we next performed a machine learning experiment on the 385 student turns where the two annotators agreed on the emotion label. Our predictive accuracy for this data was 84.75% (using 10 x 10 cross-validation as in (Litman and Forbes, 2003)). Compared to a baseline accuracy of 72.74% achieved by always predicting the majority (neutral) class, our result yields a relative improvement of 44.06%.5 negative neutral positive</Paragraph> <Paragraph position="3"> where error(x) is 100 - %accuracy(x).</Paragraph> <Paragraph position="4"> Con ating Weak and Negative/Positive Classes In a second analysis, we again distinguished only our three main emotion classes; however, this time weak negative was con ated with negative, and weak positive was con ated with positive. Our mixed class was again conated with neutral. A confusion matrix summarizing the resulting inter-annotator agreement is shown in Table 2.</Paragraph> <Paragraph position="5"> As shown, although the number of agreed negative and positive turns increased, overall interannotator agreement decreased to 340/453 turns, or 75.06% (Kappa = 0.60).</Paragraph> <Paragraph position="6"> We performed our machine learning experiment on these 340 agreed student turns. The predictive accuracy for this data decreased to 79.29%; however, baseline (majority class) accuracy also decreased to 53.24%; thus rel-</Paragraph> </Section> <Section position="7" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Negative/Non-Negative Classes </SectionTitle> <Paragraph position="0"> As Tables 1-2 indicate, our annotators found the positive class the most dif cult to annotate and agree upon, and the positive class was also the least frequent class overall. Not surprisingly, our prior machine learning experiments have also showed that the positive class is the hardest to predict (Litman and Forbes, 2003). We thus next explored a binary analysis where our positive and neutral classes are con ated, yielding a negative/nonnegative distinction akin to (Lee et al., 2001). Again however we experimented with con ating our minor weak classes with either the neutral class or their main class counterparts (e.g. weak negative a12 negative).</Paragraph> <Paragraph position="1"> Two confusion matrices summarizing the resulting inter-annotator agreements are shown in Tables 3 - 4.</Paragraph> <Paragraph position="2"> In Table 3, our three minor classes are con ated with the neutral class. Interannotator agreement in this case rises sharply to 420/453 turns, or 92.72% (Kappa = 0.80). The predictive accuracy for this data increased to 86.83%; however, baseline (majority class) accuracy also increased to 78.57%; thus relative improvement in fact In Table 4, our two weak classes are con ated with their main class counterparts. Interannotator agreement only rises to 403/453 turns, or 88.96% (Kappa = 0.74), Predictive accuracy decreases to 82.94%. However, base-line (majority class) accuracy also decreases to 72.21%; thus relative improvement was comparable, at 38.61%</Paragraph> </Section> <Section position="8" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Emotional/Non-Emotional Classes </SectionTitle> <Paragraph position="0"> We also explored an alternative binary analysis that con ated our positive and negative classes, yielding an emotional/non-emotional distinction, akin to (Batliner et al., 2000). Again we con ated our minor weak classes with either the neutral class or their main class counterparts, as shown in in Tables 5-6. In Table 5, our three minor classes are con ated with the neutral class, yielding agreement on 389/453 turns, or 85.87% (Kappa = 0.67).</Paragraph> <Paragraph position="1"> The predictive accuracy was high at 85.07%, while base-line (majority) accuracy was 71.98%; thus relative im- null In Table 6, weak classes are con ated with their main class counterparts. Interannotator agreement decreases to 350/453 turns, or 77.26% (Kappa = 0.55). Predictive accuracy was high at 86.14%; moreover, baseline (majority) accuracy was the lowest yet seen, 51.71%, and relative improvement was the best yet seen, at 71.30% emotional non-emotional</Paragraph> </Section> <Section position="9" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Summary </SectionTitle> <Paragraph position="0"> A summary of our results across analyses of agreed student turns are shown in Table 7. NPN represents analyses distinguishing negative, neutral and positive emotions, NnN represents negative/non-negative analyses, and EnE represents emotional/non-emotional analyses. Column K shows Kappa for each analysis, Acc shows the predictive accuracy achieved by machine learning, Base shows the baseline (majority class) accuracy, and RI show the relative improvement achieved by learning compared with this baseline.</Paragraph> <Paragraph position="1"> As can be seen, there is no single optimal way to conate the original 6 classes; optimality depends on whether maximizing Kappa, predictive accuracy, or expressiveness is most important. For example, con ating minor and neutral labels (the rst three rows) yields better annotation reliability than for their counterparts (con ating weak and main labels) in the last three rows; the reverse is true, however, for machine learning performance (measured by relative improvement over the majority class baseline). With respect to expressiveness, only the 3-way NPN distinction can explicitly distinguish positive emotions. With respect to the binary distinctions, annotating negative/non-negative (NnN) can be done most reliably, while predicting emotional/non-emotional (EnE) yields a better relative improvement.</Paragraph> </Section> <Section position="10" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Consensus-Labeled Student Turns </SectionTitle> <Paragraph position="0"> Following (Ang et al., 2002), we also explored consensus labeling, both to increase our usable data set for prediction, and to include the more dif cult annotation cases.</Paragraph> <Paragraph position="1"> For consensus labeling, the original annotators revisited each originally disagreed case, and through discussion, sought a consensus label. Agreement thus rose across all analyses, to 99.12%; we discarded 8/453 turns for lack of consensus. A summary of the consensus labeling across all 6 analyses discussed above is shown in Table 8. The row and column labels are as above, e.g.</Paragraph> <Paragraph position="2"> the NPN row represents turns consensus-labeled as negative/neutral/positive, rst when all three minor classes are con ated with neutral, and second where the weak minor classes are con ated with their main counterparts.</Paragraph> <Paragraph position="3"> minor a12 neu weak a12 main neg neu pos neg neu pos We performed our machine learning experiment on the consensus data for all 6 analyses. A summary of our results are shown in Table 9. A comparison of Tables 7-9 shows that for all of our evaluation metrics, our results decrease across all analyses when using consensus data; similar ndings were observed in (Ang et al., 2002). While increasing our data set using more dif cult examples decreases predictive ability, note that our consensus results are still an improvement over the baseline.</Paragraph> </Section> <Section position="11" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.1 Minor Emotion Classes </SectionTitle> <Paragraph position="0"> Our analyses so far distinguished only our 3 main emotion classes; our 3 minor classes were always con ated with one or the other of the main classes. In part, this is because our minor labels were consistently employed only later in the development of our scheme; in early versions, annotators optionally labeled the minor classes (in the NOTES pane), for the purpose of post-annotation discussion. At present, only the last 5 of our 10 annotated dialogues are consistently labeled with minor classes. Table 10 shows a confusion matrix for the annotation of all 6 emotion classes for these 5 dialogues. Interannotator agreement is 142/211 turns, or 67.30% (Kappa = 0.54).</Paragraph> <Paragraph position="1"> Compared to Section 5, we see that this higher level of granularity yields a lower level of agreement. However, most disagreements fall adjacent to the diagonal, indicating that they are mostly differences in strength rather than differences in polarity. The analyses in Section 5 investigated various means of resolving these differences.</Paragraph> <Paragraph position="2"> neg w. neg neut w. pos pos mix</Paragraph> </Section> <Section position="12" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.2 Speci c Emotions </SectionTitle> <Paragraph position="0"> Our analyses in Section 5 did not consider the speci c emotion annotations in our EMOb pane. This is in part because, as with our minor labels, our speci c emotion labels were only consistently employed when annotating the last 5 of our 10 dialogues. If we consider only the 66 turns where both annotators agreed that the turn was negative (weak or strong), and view multiple emotion labels which overlap with single emotions as agreed (e.g.</Paragraph> <Paragraph position="1"> sad+bored agrees with a sad or bored label), interannotator agreement is 45/66 turns, or 68.18% (Kappa = 0.41). The same analysis for the 13 positive turns yields 100% agreement (Kappa = 1).</Paragraph> <Paragraph position="2"> The labels we've included so far are those we've encountered in our human-human tutoring dialogues; we expect to see some differences in the human-computer dialogues, as discussed in Section 6.3, and continue to employ the other label. In part, the decision about which speci c emotions to ultimately recognize in our system depends on what we want the system to adapt to. This in turn requires some understanding of how human tutors adapt to different emotions. For example, perhaps our tutor responds differently to anger, uncertainty, boredom and confusion, but responds the same to most positive emotions. We are currently investigating this in our annotated human-human tutoring dialogues.</Paragraph> </Section> <Section position="13" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 6.3 Human-Computer Corpus </SectionTitle> <Paragraph position="0"> We have just begun annotating our corpus of human-computer spoken tutoring dialogues; to date we have annotated 5 dialogues from 5 different students.</Paragraph> <Paragraph position="1"> We have applied the 6 reliability analyses in this paper to these annotations, and have found again that most disagreements are simply differences in strength rather than differences in polarity. Our best interannotator reliability was found using the NnN, weak a12 main analysis (contrary to the human-human ndings), which gave agreement of 96/115 turns, or 83.48% (Kappa = 0.67).</Paragraph> <Paragraph position="2"> The corpus example in Figure 4 (Appendix A) highlights differences between our human-human and human-computer tutoringdialogues that potentiallymight impact emotion annotation. First, both the average student turn length in words, and the average number of student turns per dialogue, are much shorter in the human-computer than in the human-human dialogues. This means that there is less information in the human-computer dialogues to make use of when judging expressed emotions.</Paragraph> <Paragraph position="3"> Second, errors in speech and natural language processing can have a signi cant effect on the student emotional state in the human-computer tutoring dialogues. Such emotions don't concern either the PHYS domain or the TUT domain, and suggest that we might want to add a third NLP domain if we want the system to respond to these emotions differently. Relatedly, we already see frequency differences across the human-human and human-computer dialogues with respect to speci c emotions, for example an increased use of irritated in the human-computer data. Finally, computer tutors are far less exible than human tutors. This alone can effect student emotional state, and furthermore it can limit how the student expresses their own emotional states. For example, in the human-human dialogues we see more student initiative, groundings, and references to prior problems.</Paragraph> </Section> </Section> class="xml-element"></Paper>