XML Viewer - p96-1038

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/p96-1038_evalu.xml
Size: 12,178 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?>
<Paper uid="P96-1038">
  <Title>A Prosodic Analysis of Discourse Segments in Direction- Giving Monologues</Title>
  <Section position="6" start_page="287" end_page="290" type="evalu">
    <SectionTitle>
4 Results
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="287" end_page="288" type="sub_section">
      <SectionTitle>
4.1 Discourse Segmentation
</SectionTitle>
      <Paragraph position="0"> Labels on which all labelers in the group agreed are termed the CONSENSUS LABELS. 3 The consensus labels for segment-initial (SBEG), segment-final (SF), and segment-medial (SCONT, defined as neither SBEG nor SF) phrase labels are given in Table 1. 4  Note that group T and group S segmentations differ significantly, in contrast to earlier findings of ttirschberg and Grosz (1992) on a corpus of read-aloud news stories and in support of informal findings of Swerts (1995). Table 1 shows that group S produced significantly more consensus boundaries for both read (p&lt;.001, X=58.8, df=l) and spontaneous (p&lt;.001, X=55.4, df=l) speech than did  do not necessarily sum to the total consensus agreement percentage, since a phrase is both segment-initial and segment-final when it makes up a segment by itself. group T. When the read and spontaneous data are pooled, group S agreed upon significantly more SBEG boundaries (p&lt;.05, X=4.7, df=l) as well as SF boundaries (p&lt;.05, X=4.4, df=l) than did group T. Further, it is not the case that text-alone segmenters simply chose to place fewer boundaries in the discourse; if this were so, then we would expect a high percentage of SCONT consensus labels where no SBEGs or SFs were identified. Instead, we find that the number of consensus SCONTs was significantly higher for text-and-speech labelings than for text-alone (p&lt;.001, X=49.1, df=l). It appears that the speech signal can help disambiguate among alternate segmentations of the same text. Finally, the data in Table 1 show that spontaneous speech can be segmented as reliably as its read counterpart, contrary to Ayer's results (1992).</Paragraph>
      <Paragraph position="1">  Comparisons of inter-labeler reliability, that is, the reproducibility of a coding scheme given multiple labelers, provide another perspective on the segmentation data. How best to measure inter-labeler reliability for discourse segmentation tasks, especially for hierarchical segmentation, is an open research question (Passonneau and Litman, to appear; Carletta, 1995; Flammia and Zue, 1995; Rotondo, 1984; Swerts, 1995). For comparative purposes, we explored several measures proposed in the literature, namely, COCHRAN'S Q and the KAPPA (~) COEF-FICIENT (Siegel and Castellan, 1988). Cochran's Q, originally proposed in (Hirschberg and Grosz, 1992) to measure the likelihood that similarity among labelings was due to chance, was not useful in the current study; all tests of similarity using this metric (pairwise, or comparing all labelers) gave probability near zero. We concluded that this statistic did not serve, for example, to capture the differences observed between labelings from text alone versus labelings from text and speech.</Paragraph>
      <Paragraph position="2"> Recent discourse annotation studies (Isard and Carletta, 1995; Flammia and Zue, 1995) have measured reliability using the g coefficient, which factors out chance agreement taking the expected distribution of categories into account. This coefficient is defined as Po- P~ 1-Ps where Po represents the observed agreement and PE represents the expected agreement. Typically, values of .7 or higher for this measure provide evidence of good reliability, while values of .8 or greater indicate high reliability. Isard and Carletta (1995) report pairwise a scores ranging from .43 to .68 in a study of naive and expert classifications of types of 'moves' in the Map Task dialogues. For theory-neutral discourse segmentations of information-seeking dialogues, Flammia (Flammia and Zue, 1995) reports an average pairwise  of .45 for five labelers and of .68 for the three most similar labelers.</Paragraph>
      <Paragraph position="3"> An important issue in applying the t~ coefficient is how one calculates the expected agreement using prior distributions of categories. We first calculated the prior probabilities for our data based simply on the distribution of SBEG versus non-SBEG labels for all labelers on one of the nine direction-giving tasks in this study, with separate calculations for the read and spontaneous versions. This task, which represented about 8% of the data for both speaking styles, was chosen because it was midway in planning complexity and in length among all the tasks. Using these distributions, we calculated x coefficients for each pair of labelers in each condition for the remaining eight tasks in our corpus. The observed percentage of SBEG labels, prior distribution for SBEG, average of the pairwise ~ scores, and standard deviations for those scores are presented in  The average g scores for group T segmenters indicate weak inter-labeler reliability. In contrast, average t~ scores for group S are .8 or better, indicating a high degree of inter-labeler reliability. Thus, application of this somewhat stricter reliability metric confirms that the availability of speech critically influences how listeners perceive discourse structure. The calculation of reliability for SBEG versus non-SBEG labeling in effect tests the similarity of linearized segmentations and does not speak to the issue of how similar the labelings are in terms of hierarchical structure. Flammia has proposed a method for generalizing the use of the g coefficient for hierarchical segmentation that gives an upper-bound estimate on inter-labeler agreement. 5 We applied this metric to our segmentation data, calculating weighted averages for pairwise ~ scores averaged for each task. Results for each condition, together with the lowest and highest average ~ scores over the tasks, are presented in Table 3.</Paragraph>
      <Paragraph position="4"> 5Flammia uses a flexible definition of segment match to calculate pairwise observed agreement: roughly, a segment in one segmentation is matched if both its SBEG and SF correspond to segment boundary locations in the other segmentation.</Paragraph>
    </Section>
    <Section position="2" start_page="288" end_page="290" type="sub_section">
      <SectionTitle>
4.2 Intonational Features of Segments
</SectionTitle>
      <Paragraph position="0"> For purposes of intonational analysis, we take advantage of the high degree of agreement among our discourse labelers and include in each segment boundary class (SBEG, SF, and SCONT) only the phrases whose classification all subjects agreed upon. We term these the CONSENSUS-LABELED PHRASES, and compare their features to those of all phrases not in the relevant class (i.e., non-consensus-labeled phrases and consensus-labeled phrases of the other types). Note that there were one-third fewer consensus-labeled phrases for text-alone labelings than for text-and-speech (see Table 1). We examined the following acoustic and prosodic features of SBEG, SCONT, and SF consensus-labeled phrases: f0 maximum and f0 average; 6 rms (energy) maximum and rms average; speaking rate (measured in syllables per second); and duration of preceding and subsequent silent pauses. As for the segmentation analyses, we compared intonational correlates of segment boundary types not only for group S versus group T, but also for spontaneous versus read speech. While correlates have been identified in read speech, they have been observed in spontaneous speech only rarely and descriptively.</Paragraph>
      <Paragraph position="1"> 6We calculated f0 maximum in two ways: as simple f0 peak within the intermediate phrase and also as f0 maximum measured at the rms maximum of the sonorant portion of the nuclear-accented syllable in the intermediate phrase (HIGH F0 in the ToBI framework (Pitrelli, Beckman, and Hirschberg, 1994)). The latter measure proved more robust, so we report results based on this metric. The same applies to measurement of rms maximum. Average f0 and rms were calculated over the entire intermediate phrase.</Paragraph>
      <Paragraph position="2">  higher higher higher higher longer shorter higher higher higher higher longer shorter lower** lower** lower shorter ? shorter ? slower* shorter* shorter ? faster *? shorter longer faster ? shorter longer  shorter t shorter t shorter ? shorter ? shorter longer shorter longer  We found strong correlations for consensus SBEG, SCONT, and SF phrases for all conditions. Results for group T are given in Table 4, and for group S, in  Consensus SBEG phrases in all conditions possess significantly higher maximum and average f0, higher maximum and average rms, shorter subsequent pause, and longer preceding pause. For consensus SCONT phrases, we found some differences between read and spontaneous speech for both labeling methods. Features for group T included significantly lower f0 maximum and average and lower rms maximum and average for read speech, but only lower f0 maximum for the spontaneous condition.</Paragraph>
      <Paragraph position="3"> Group S features for SCONT were identical to group T except for the absence of a correlation for maximum rms. While SCONT phrases for both speaking styles exhibited significantly shorter preceding and subsequent pauses than other phrases, only the spontaneous condition showed a significantly slower rate. For consensus SF phrases, we again found similar patterns for both speaking styles and both label7T-tests were used to test for statistical significance of difference in the means of two classes of phrases. Results reported are significant at the .005 level or better, except where '*' indicates significance at the .03 level or better. Results were calculated using one-tailed t-tests, except where't' indicates a two-tailed test.</Paragraph>
      <Paragraph position="4"> ing methods, namely lower f0 maximum and average, lower rms maximum and average, faster speaking rate, shorter preceding pauses, and longer subsequent pauses.</Paragraph>
      <Paragraph position="5"> While it may appear somewhat surprising that results for both labeling methods match so closely, in fact, correlations for text-and-speech labels presented in Table 5 were almost invariably statistically stronger than those for text-alone labels presented in Table 4. These more robust results for text-and-speech labelings occur even though the data set of consensus labels is considerably larger than the data set of consensus text-alone labelings.</Paragraph>
      <Paragraph position="6">  With a view toward automatically segmenting a spoken discourse, we would like to directly classify phrases of all three discourse categories. But SCONT and SF phrases exhibit similar prominence features and appear distinct from each other only in terms of timing differences. A second issue is whether such classification can be done 'on-line.' To address both of these issues, we made pairwise comparisons of consensus-labeled phrase groups using measures of relative change in acoustic-prosodic parameters over a local window of two consecutive phrases. Table 6 presents significant findings on relative changes in f0, loudness (measured in decibels), and speaking rate, from prior to current intermedi- null increase increase decrease* t ate phrase, s First, note that SBEG is distinguished from both SCONT and SF in terms of f0 change and db change from prior phrase; that is, while SBEG phrases are distinguished on a variety of measures from all other phrases (including non-consensus-labeled phrases) in Table 5, this table shows that SBEGs are also distinguishable directly from each of the other consensus-labeled categories. Second, while SCONT and SF appear to share prominence features in Table 5, Table 6 reveals differences between SCONT and SF in amount of f0 and db change. Thus, in addition to lending themselves to on-line processing, local measures may also capture valuable prominence cues to distinguish between segment-medial and segment-final phrases.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML