File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/h05-1073_metho.xml
Size: 19,962 bytes
Last Modified: 2025-10-06 14:09:34
<?xml version="1.0" standalone="yes"?> <Paper uid="H05-1073"> <Title>Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), pages 579-586, Vancouver, October 2005. c(c)2005 Association for Computational Linguistics Emotions from text: machine learning for text-based emotion prediction</Title> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> In addition to information, text contains attitudinal, and more specifically, emotional content. This paper explores the text-based emotion prediction problem empirically, using supervised machine learning with the SNoW learning architecture. The goal is to classify the emotional affinity of sentences in the narrative domain of children's fairy tales, for subsequent usage in appropriate expressive rendering of text-to-speech synthesis. Initial experiments on a preliminary data set of 22 fairy tales show encouraging results over a na&quot;ive baseline and BOW approach for classification of emotional versus non-emotional contents, with some dependency on parameter tuning. We also discuss results for a tripartite model which covers emotional valence, as well as feature set alternations. In addition, we present plans for a more cognitively sound sequential model, taking into consideration a larger set of basic emotions.</Paragraph> </Section> <Section position="6" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Text does not only communicate informative contents, but also attitudinal information, including emotional states. The following reports on an empirical study of text-based emotion prediction.</Paragraph> <Paragraph position="1"> Section 2 gives a brief overview of the intended application area, whereas section 3 summarizes related work. Next, section 4 explains the empirical study, including the machine learning model, the corpus, the feature set, parameter tuning, etc. Section 5 presents experimental results from two classification tasks and feature set modifications. Section 6 describes the agenda for refining the model, before presenting concluding remarks in 7.</Paragraph> </Section> <Section position="7" start_page="0" end_page="579" type="metho"> <SectionTitle> 2 Application area: Text-to-speech </SectionTitle> <Paragraph position="0"> Narrative text is often especially prone to having emotional contents. In the literary genre of fairy tales, emotions such as HAPPINESS and ANGER and related cognitive states, e.g. LOVE or HATE, become integral parts of the story plot, and thus are of particular importance. Moreover, the story teller reading the story interprets emotions in order to orally convey the story in a fashion which makes the story come alive and catches the listeners' attention.</Paragraph> <Paragraph position="1"> In speech, speakers effectively express emotions by modifying prosody, including pitch, intensity, and durational cues in the speech signal. Thus, in order to make text-to-speech synthesis sound as natural and engaging as possible, it is important to convey the emotional stance in the text. However, this implies first having identified the appropriate emotional meaning of the corresponding text passage.</Paragraph> <Paragraph position="2"> Thus, an application for emotional text-to-speech synthesis has to solve two basic problems. First, what emotion or emotions most appropriately describe a certain text passage, and second, given a text passage and a specified emotional mark-up, how to render the prosodic contour in order to convey the emotional content, (Cahn, 1990). The text-based emotion prediction task (TEP) addresses the first of these two problems.</Paragraph> </Section> <Section position="8" start_page="579" end_page="579" type="metho"> <SectionTitle> 3 Previous work </SectionTitle> <Paragraph position="0"> For a complete general overview of the field of affective computing, see (Picard, 1997). (Liu, Lieberman and Selker, 2003) is a rare study in text-based inference of sentence-level emotional affinity. The authors adopt the notion of basic emotions, cf. (Ekman, 1993), and use six emotion categories: ANGER, DISGUST, FEAR, HAPPINESS, SADNESS, SURPRISE. They critique statistical NLP for being unsuccessful at the small sentence level, and instead use a database of common-sense knowledge and create affect models which are combined to form a representation of the emotional affinity of a sentence.</Paragraph> <Paragraph position="1"> At its core, the approach remains dependent on an emotion lexicon and hand-crafted rules for conceptual polarity. In order to be effective, emotion recognition must go beyond such resources; the authors note themselves that lexical affinity is fragile. The method was tested on 20 users' preferences for an email-client, based on user-composed text emails describing short but colorful events. While the users preferred the emotional client, this evaluation does not reveal emotion classification accuracy, nor how well the model generalizes on a large data set.</Paragraph> <Paragraph position="2"> Whereas work on emotion classification from the point of view of natural speech and human-computer dialogues is fairly extensive, e.g. (Scherer, 2003), (Litman and Forbes-Riley, 2004), this appears not to be the case for text-to-speech synthesis (TTS). A short study by (Sugimoto et al., 2004) addresses sentence-level emotion recognition for Japanese TTS. Their model uses a composition assumption: the emotion of a sentence is a function of the emotional affinity of the words in the sentence.</Paragraph> <Paragraph position="3"> They obtain emotional judgements of 73 adjectives and a set of sentences from 15 human subjects and compute words' emotional strength based on the ratio of times a word or a sentence was judged to fall into a particular emotion bucket, given the number of human subjects. Additionally, they conducted an interactive experiment concerning the acoustic rendering of emotion, using manual tuning of prosodic parameters for Japanese sentences. While the authors actually address the two fundamental problems of emotional TTS, their approach is impractical and most likely cannot scale up for a real corpus. Again, while lexical items with clear emotional meaning, such as happy or sad, matter, emotion classification probably needs to consider additional inference mechanisms. Moreover, a na&quot;ive compositional approach to emotion recognition is risky due to simple linguistic facts, such as context-dependent semantics, domination of words with multiple meanings, and emotional negation.</Paragraph> <Paragraph position="4"> Many NLP problems address attitudinal meaning distinctions in text, e.g. detecting subjective opinion documents or expressions, e.g. (Wiebe et al, 2004), measuring strength of subjective clauses (Wilson, Wiebe and Hwa, 2004), determining word polarity (Hatzivassiloglou and McKeown, 1997) or texts' attitudinal valence, e.g. (Turney, 2002), (Bai, Padman and Airoldi, 2004), (Beineke, Hastie and Vaithyanathan, 2003), (Mullen and Collier, 2003), (Pang and Lee, 2003). Here, it suffices to say that the targets, the domain, and the intended application differ; our goal is to classify emotional text passages in children's stories, and eventually use this information for rendering expressive child-directed storytelling in a text-to-speech application. This can be useful, e.g. in therapeutic education of children with communication disorders (van Santen et al., 2003).</Paragraph> </Section> <Section position="9" start_page="579" end_page="582" type="metho"> <SectionTitle> 4 Empirical study </SectionTitle> <Paragraph position="0"> This part covers the experimental study with a formal problem definition, computational implementation, data, features, and a note on parameter tuning.</Paragraph> <Section position="1" start_page="579" end_page="580" type="sub_section"> <SectionTitle> 4.1 Machine learning model </SectionTitle> <Paragraph position="0"> Determining emotion of a linguistic unit can be cast as a multi-class classification problem. For the flat case, let T denote the text, and s an embedded linguistic unit, such as a sentence, where s [?] T. Let k be the number of emotion classes E = {em1,em2,..,emk}, where em1 denotes the special case of neutrality, or absence of emotion. The goal is to determine a mapping function f : s - emi, such that we obtain an ordered labeled pair (s,emi).</Paragraph> <Paragraph position="1"> The mapping is based on F = {f1,f2,..,fn}, where F contains the features derived from the text.</Paragraph> <Paragraph position="2"> Furthermore, if multiple emotion classes can characterize s, then given E' [?] E, the target of the mapping function becomes the ordered pair (s,Eprime).</Paragraph> <Paragraph position="3"> Finally, as further discussed in section 6, the hierarchical case of label assignment requires a sequen- null tial model that further defines levels of coarse versus fine-grained classifiers, as done by (Li and Roth, 2002) for the question classification problem.</Paragraph> </Section> <Section position="2" start_page="580" end_page="580" type="sub_section"> <SectionTitle> 4.2 Implementation </SectionTitle> <Paragraph position="0"> Whereas our goal is to predict finer emotional meaning distinctions according to emotional categories in speech; in this study, we focus on the basic task of recognizing emotional passages and on determining their valence (i.e. positive versus negative) because we currently do not have enough training data to explore finer-grained distinctions. The goal here is to get a good understanding of the nature of the TEP problem and explore features which may be useful.</Paragraph> <Paragraph position="1"> We explore two cases of flat classification, using a variation of the Winnow update rule implemented in the SNoW learning architecture (Carlson et al., 1999),1 which learns a linear classifier in feature space, and has been successful in several NLP applications, e.g. semantic role labeling (Koomen, Punyakanok, Roth and Yih, 2005). In the first case, the set of emotion classes E consists of EMOTIONAL versus non-emotional or NEUTRAL, i.e. E = {N,E}. In the second case, E has been incremented with emotional distinctions according to the valence, i.e. E = {N,PE,NE}. Experiments used 10-fold cross-validation, with 90% train and 10% test data.2</Paragraph> </Section> <Section position="3" start_page="580" end_page="581" type="sub_section"> <SectionTitle> 4.3 Data </SectionTitle> <Paragraph position="0"> The goal of our current data annotation project is to annotate a corpus of approximately 185 children stories, including Grimms', H.C. Andersen's and B.</Paragraph> <Paragraph position="1"> Potter's stories. So far, the annotation process proceeds as follows: annotators work in pairs on the same stories. They have been trained separately and work independently in order to avoid any annotation bias and get a true understanding of the task difficulty. Each annotator marks the sentence level with one of eight primary emotions, see table 1, reflecting an extended set of basic emotions (Ekman, 1993). In order to make the annotation process more focused, emotion is annotated from the point of view of the text, i.e. the feeler in the sentence. While the primary emotions are targets, the sentences are also sults are not included. Overall, Perceptron performed worse. marked for other affective contents, i.e. background mood, secondary emotions via intensity, feeler, and textual cues. Disagreements in annotations are resolved by a second pass of tie-breaking by the first author, who chooses one of the competing labels.</Paragraph> <Paragraph position="2"> Eventually, the completed annotations will be made available.</Paragraph> <Paragraph position="3"> Emotion annotation is hard; interannotator agreement currently range at k = .24[?].51, with the ratio of observed annotation overlap ranging between 45-64%, depending on annotator pair and stories assigned. This is expected, given the subjective nature of the annotation task. The lack of a clear definition for emotion vs. non-emotion is acknowledged across the emotion literature, and contributes to dynamic and shifting annotation targets. Indeed, a common source of confusion is NEUTRAL, i.e. deciding whether or not a sentence is emotional or non-emotional. Emotion perception also depends on which character's point-of-view the annotator takes, and on extratextual factors such as annotator's personality or mood. It is possible that by focusing more on the training of annotator pairs, particularly on joint training, agreement might improve. However, that would also result in a bias, which is probably not preferable to actual perception. Moreover, what agreement levels are needed for successful expressive TTS remains an empirical question.</Paragraph> <Paragraph position="4"> The current data set consisted of a preliminary annotated and tie-broken data set of 1580 sentence, or 22 Grimms' tales. The label distribution is in table 2. NEUTRAL was most frequent with 59.94%.</Paragraph> <Paragraph position="5"> Next, for the purpose of this study, all emotional classes, i.e. A, D, F, H, SA, SU+, SU-, were combined into one emotional superclass E for the first experiment, as shown in table 3. For the second experiment, we used two emotional classes, i.e. positive versus negative emotions; PE={H, SU+} and NE={A, D, F, SA, SU-}, as seen in table 4.</Paragraph> </Section> <Section position="4" start_page="581" end_page="582" type="sub_section"> <SectionTitle> 4.4 Feature set </SectionTitle> <Paragraph position="0"> The feature extraction was written in python. SNoW only requires active features as input, which resulted in a typical feature vector size of around 30 features.</Paragraph> <Paragraph position="1"> The features are listed below. They were implemented as boolean values, with continuous values represented by ranges. The ranges generally overlapped, in order to get more generalization coverage.</Paragraph> <Paragraph position="2"> 1. First sentence in story 2. Conjunctions of selected features (see below) 3. Direct speech (i.e. whole quote) in sentence 4. Thematic story type (3 top and 15 sub-types) 5. Special punctuation (! and ?) 6. Complete upper-case word 7. Sentence length in words (0-1, 2-3, 4-8, 9-15, 16-25, 26-35, >35) 8. Ranges of story progress (5-100%, 15-100%, 80-100%, 90-100%) 9. Percent of JJ, N, V, RB (0%, 1-100%, 50100%, 80-100%) 10. V count in sentence, excluding participles (0-1, 0-3, 0-5, 0-7, 0-9, > 9) 11. Positive and negative word counts ( [?] 1, [?] 2, [?] 3, [?] 4, [?] 5, [?] 6) 12. WordNet emotion words 13. Interjections and affective words 14. Content BOW: N, V, JJ, RB words by POS Feature conjunctions covered pairings of counts of positive and negative words with range of story progress or interjections, respectively.</Paragraph> <Paragraph position="3"> Feature groups 1, 3, 5, 6, 7, 8, 9, 10 and 14 are extracted automatically from the sentences in the stories; with the SNoW POS-tagger used for features 9, 10, and 14. Group 10 reflects how many verbs are active in a sentence. Together with the quotation and punctuation, verb domination intends to capture the assumption that emotion is often accompanied by increased action and interaction. Feature group 4 is based on Finish scholar Antti Aarne's classes of folk-tale types according to their informative thematic contents (Aarne, 1964). The current tales have 3 top story types (ANIMAL TALES, ORDINARY FOLK-TALES, and JOKES AND ANECDOTES), and 15 subtypes (e.g. supernatural helpers is a subtype of the ORDINARY FOLK-TALE). This feature intends to provide an idea about the story's general affective personality (Picard, 1997), whereas the feature reflecting the story progress is hoped to capture that some emotions may be more prevalent in certain sections of the story (e.g. the happy end).</Paragraph> <Paragraph position="4"> For semantic tasks, words are obviously important. In addition to considering 'content words', we also explored specific word lists. Group 11 uses 2 lists of 1636 positive and 2008 negative words, obtained from (Di Cicco et al., online). Group 12 uses lexical lists extracted from WordNet (Fellbaum, 1998), on the basis of the primary emotion words in their adjectival and nominal forms. For the adjectives, Py-WordNet's (Steele et al., 2004) SIMI-LAR feature was used to retrieve similar items of the primary emotion adjectives, exploring one additional level in the hierarchy (i.e. similar items of all senses of all words in the synset). For the nouns and any identical verbal homonyms, synonyms and hyponyms were extracted manually.3 Feature group 13 used a short list of 22 interjections collected manually by browsing educational ESL sites, whereas the affective word list of 771 words consisted of a combination of the non-neutral words from (Johnson-Laird and Oatley, 1989) and (Siegle, online). Only a subset of these lexical lists actually occurred.4 3Multi-words were transformed to hyphenated form.</Paragraph> <Paragraph position="5"> 4At this point, neither stems and bigrams nor a list of onomatopoeic words contribute to accuracy. Intermediate resource processing inserted some feature noise.</Paragraph> <Paragraph position="6"> The above feature set is henceforth referred to as all features, whereas content BOW is just group 14. The content BOW is a more interesting baseline than the na&quot;ive one, P(Neutral), i.e. always assigning the most likely NEUTRAL category. Lastly, emotions blend and transform (Liu, Lieberman and Selker, 2003). Thus, emotion and background mood of immediately adjacent sentences, i.e. the sequencing, seems important. At this point, it is not implemented automatically. Instead, it was extracted from the manual emotion and mood annotations. If sequencing seemed important, an automatic method using sequential target activation could be added next.</Paragraph> </Section> <Section position="5" start_page="582" end_page="582" type="sub_section"> <SectionTitle> 4.5 Parameter tuning </SectionTitle> <Paragraph position="0"> The Winnow parameters that were tuned included promotional a, demotional b, activation threshold th, initial weights o, and the regularization parameter, S, which implements a margin between positive and negative examples. Given the currently fairly limited data, results from 2 alternative tuning methods, applied to all features, are reported.</Paragraph> <Paragraph position="1"> * For the condition called sep-tune-eval, 50% of the sentences were randomly selected and set aside to be used for the parameter tuning process only. Of this subset, 10% were subsequently randomly chosen as test set with the remaining 90% used for training during the automatic tuning process, which covered 4356 different parameter combinations. Resulting parameters were: a = 1.1, b = 0.5, th = 5, o = 1.0, S = 0.5. The remaining half of the data was used for training and testing in the 10-fold cross-validation evaluation. (Also, note the slight change for P(Neutral) in table 5, due to randomly splitting the data.) * Given that the data set is currently small, for the condition named same-tune-eval, tuning was performed automatically on all data using a slightly smaller set of combinations, and then manually adjusted against the 10-fold cross-validation process. Resulting parameters were: a = 1.2, b = 0.9, th = 4, o = 1, S = 0.5. All data was used for evaluation.</Paragraph> <Paragraph position="2"> Emotion classification was sensitive to the selected tuning data. Generally, a smaller tuning set resulted in pejorative parameter settings. The random selection could make a difference, but was not explored.</Paragraph> </Section> </Section> class="xml-element"></Paper>