File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/n06-3006_intro.xml

Size: 9,085 bytes

Last Modified: 2025-10-06 14:03:29

<?xml version="1.0" standalone="yes"?>
<Paper uid="N06-3006">
  <Title>Detecting Emotion in Speech: Experiments in Three Domains</Title>
  <Section position="3" start_page="0" end_page="232" type="intro">
    <SectionTitle>
2 Completed Work
</SectionTitle>
    <Paragraph position="0"> This section describes my current research on emotion classi cation in three domains and forms the foundation of my dissertation. For each domain, I have adopted an experimental design wherein each utterance in a corpus is annotated with one or more emotion labels, features are extracted from these utterances, and machine learning experiments are run to determine emotion prediction accuracy.</Paragraph>
    <Section position="1" start_page="0" end_page="231" type="sub_section">
      <SectionTitle>
2.1 EPSaT
</SectionTitle>
      <Paragraph position="0"> The publicly-available Emotional Prosody Speech and Transcription corpus1 (EPSaT) comprises recordings of professional actors reading short (four syllables each) dates and numbers (e.g., 'two-thousand-four') with different emotional states. I chose a subset of 44 utterances from 4 speakers (2 male, 2 female) from this corpus and conducted a web-based survey to subjectively label each utterance for each of 10 emotions, divided evenly for valence. These emotions included the positive emotion categories: con dent, encouraging, friendly, happy, interested; and the negative emotion categories: angry, anxious, bored, frustrated, sad.</Paragraph>
      <Paragraph position="1"> Several features were extracted from each utterance in this corpus, each one designed to capture emotional content. Global acoustic-prosodic information e.g., speaking rate and minimum, maximum, and mean pitch and intensity has been well known since the 1960s and 1970s  to convey emotion to some extent (e.g., (Davitz, 1964; Scherer et al., 1972)). In addition to these features, I also included linguistically meaningful prosodic information in the form of ToBI labels (Beckman et al., 2005), as well as the spectral tilt of the vowel in each utterance bearing the nuclear pitch accent.</Paragraph>
      <Paragraph position="2"> In order to evaluate the predictive power of each feature extracted from the EPSaT utterances, I ran machine learning experiments using RIPPER, a rule-learning algorithm. The EPSaT corpus was divided into training (90%) and testing (10%) sets. A binary classi cation scheme was adopted based on the observed ranking distributions from the perception survey: not at all was considered to be the absence of emotion x; all other ranks was recorded as the presence of emotion x. Performance accuracy varied with respect to emotion, but on average I observed 75% prediction accuracy for any given emotion, representing an average 22% improvement over chance performance. The most predictive included the global acoustic-prosodic features, but interesting novel ndings emerged as well; most notably, signi cant correlation was observed between negative emotions and pitch contours ending in a plateau boundary tone, whereas positive emotions correlated with the standard declarative phrasal ending (in ToBI, these would be labeled as /H-L%/ and /L-L%/, respectively). Further discussion of such ndings can be found in (Liscombe et al., 2003).</Paragraph>
    </Section>
    <Section position="2" start_page="231" end_page="231" type="sub_section">
      <SectionTitle>
2.2 HMIHY
</SectionTitle>
      <Paragraph position="0"> How May I Help YouSM (HMIHY) is a natural language human-computer spoken dialogue system developed at AT&amp;T Research Labs. The system enables AT&amp;T customers to interact verbally with an automated agent over the phone. Callers can ask for their account balance, help with AT&amp;T rates and calling plans, explanations of certain bill charges, or identi cation of numbers. Speech data collected from the deployed system has been assembled into a corpus of human-computer dialogues. The HMIHY corpus contains 5,690 complete human-computer dialogues that collectively contain 20,013 caller turns. Each caller turn in the corpus was annotated with one of seven emotional labels: positive/neutral, somewhat frustrated, very frustrated, somewhat angry, very angry, somewhat other negative2, very other negative. However, the distribution of the labels was so skewed (73.1% were labeled as positive/neutral) that the emotions were collapsed to negative and nonnegative. null In addition to the set of automatic acoustic-prosodic features found to be useful for emotional classi cation of the EPSaT corpus, the features I examined in the HMIHY corpus were designed to exploit the discourse information 2'Other negative' refers to any emotion that is perceived negatively but is not anger nor frustration.</Paragraph>
      <Paragraph position="1"> available in the domain of spontaneous human-machine conversation. Transcriptive features lexical items, lled pauses, and non-speech human noises we recorded as features, as too were the dialogue acts of each caller turn. In addition, I included contextual features that were designed to track the history of the previously mentioned features over the course of the dialogue. Speci cally, contextual information included the rate of change of the acoustic-prosodic features of the previous two turns plus the transcriptive and pragmatic features of the previous two turns as well.</Paragraph>
      <Paragraph position="2"> The corpus was divided into training (75%) and testing (25%) sets. The machine learning algorithm employed was BOOSTEXTER, an algorithm that forms a hypothesis by combining the results of several iterations of weaklearner decisions. Classi cation accuracy using the automatic acoustic-prosodic features was recorded to be approximately 75%. The majority class baseline (always guessing non-negative) was 73%. By adding the other feature-sets one by one, prediction accuracy was iteratively improved, as described more fully in (Liscombe et al., 2005b). Using all the features combined acousticprosodic, lexical, pragmatic, and contextual the resulting classi cation accuracy was 79%, a healthy 8% improvement over baseline performance and a 5% improvement over the automatic acoustic-prosodic features alone.</Paragraph>
    </Section>
    <Section position="3" start_page="231" end_page="232" type="sub_section">
      <SectionTitle>
2.3 ITSpoke
</SectionTitle>
      <Paragraph position="0"> This section describes more recent research I have been conducting with the University of Pittsburgh's Intelligent Tutoring Spoken Dialogue System (ITSpoke) (Litman and Silliman, 2004). The goal of this research is to wed spoken language technology with instructional technology in order to promote learning gains by enhancing communication richness. ITSpoke is built upon the Why2-Atlas tutoring back-end (VanLehn et al., 2002), a text-based Intelligent Tutoring System designed to tutor students in the domain of qualitative physics using natural language interaction. Several corpora have been recorded for development of ITSpoke, though most of the work presented here involves tutorial data between a student and human tutor. To date, we have labeled the human-human corpus for anger, frustration, and uncertainty.</Paragraph>
      <Paragraph position="1"> As this work is an extension of previous work, I chose to extract most of the same features I had extracted from the EPSaT and HMIHY corpora. Speci cally, I extracted the same set of automatic acoustic-prosodic features, as well as contextual features measuring the rate of change of acoustic-prosodic features of past student turns. A new feature set was introduced as well, which I refer to as the breath-group feature set, and which is an automatic method for segmenting utterances into intonationally meaningful units by identifying pauses using background noise estimation. The breath group feature set  comprises the number of breath-groups in each turn, the pause time, and global acoustic-prosodic features calculated for the rst, last, and longest breath-group in each student turn.</Paragraph>
      <Paragraph position="2"> I used the WEKA machine learning software package to classify whether a student answer was perceived to be uncertain, certain, or neutral3 in the ITSpoke human-human corpus. As a predictor, C4.5, a decision-tree learner, was boosted with AdaBoost, a learning strategy similar to the one presented in Section 2.2. The data were randomly split into a training set (90%) and a testing set (10%). The automatic acoustic-prosodic features performed at 75% accuracy, a relative improvement of 13% over the baseline performance of always guessing neutral. By adding additional feature-sets contextual and breath-group information I observed an improved prediction accuracy of 77%. Thus indicating that breath-group features are useful. I refer the reader to (Liscombe et al., 2005a) for in-depth implications and further analysis of these results. In the immediate future, I will extract features previously mentioned in Section 2.2 as well as the exploratory features I will discuss in the following section.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML