File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2318_metho.xml

Size: 8,687 bytes

Last Modified: 2025-10-06 14:09:25

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2318">
  <Title>Prosodic Cues to Discourse Segment Boundaries in Human-Computer Dialogue</Title>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Data Collection
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Speech System Description
</SectionTitle>
      <Paragraph position="0"> The utterances used in the analysis of discourse segmentation for human-computer interaction were drawn from approximately 60 hours of user interactions in a field trial of the Sun Microsystems SpeechActs system (Yankelovich et al., 1995). SpeechActs provides a voice-only interface to common desktop applications such as e-mail, calendar, weather reports, stock quotes, and time and currency conversions. The system allows conversational interactions with speech recognition provided by BBN's HARK recognizer and speech synthesis through Centigram's TruVoice system. A sample interaction appears below. 1 SpeechActs: You have 14 new messages and 47 old messages. The first new message is from Eric Baatz regarding &amp;quot;report completed.&amp;quot; User: Let me hear it.</Paragraph>
      <Paragraph position="1"> SpeechActs: &amp;quot;The first draft is ready for your comments. Eric.&amp;quot; User: Reply and include current message.</Paragraph>
      <Paragraph position="2"> SpeechActs: Preparing message to Eric Baatz.</Paragraph>
      <Paragraph position="3"> Begin recording after the tone.</Paragraph>
      <Paragraph position="4"> When finished, pause several seconds.</Paragraph>
      <Paragraph position="5"> User: Eric, I'll get those to you this afternoon.</Paragraph>
      <Paragraph position="6"> SpeechActs: Say cancel, send, or review.</Paragraph>
      <Paragraph position="7"> User: Send.</Paragraph>
      <Paragraph position="8"> SpeechActs: Message sent. What now? User: Next message.</Paragraph>
      <Paragraph position="9"> SpeechActs: New message two from Stuart Adams, subject &amp;quot;Can we meet at 3:00 today?&amp;quot; User: Switch to calendar...</Paragraph>
      <Paragraph position="10"> The field trial involved a group of nineteen subjects. Four of the participants were members of the system development staff, fourteen were volunteers drawn from Sun Microsystems' staff, and a final class of subjects consisted of one-time guest users. There were three female and fifteen male regular user subjects.</Paragraph>
      <Paragraph position="11"> All interactions with the system were recorded and digitized in standard telephone audio quality format at 8kHz sampling in 8-bit mu-law encoding during the conversation. In addition, speech recognition results, parser results, and synthesized responses were logged. A paid assistant then produced a correct verbatim transcript of all user utterances. Overall there were 7752 user utterances recorded.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Data Coding and Extraction
</SectionTitle>
      <Paragraph position="0"> Consistent discourse segmentation can be difficult even for trained experts (Nakatani et al., 1995; Swerts, 1997; Passoneau and Litman, 1997), and differences in depth of nesting for discourse structure appear to be the most problematic. As a result, we chose to examine utterances whose segment and topic initiating status would be relatively unambiguous. As the SpeechActs system consists of 6 different applications, we chose to focus on changes from application to application as reliable indicators of topic initiation. These commands are either simply the name of the desirable application, as in &amp;quot;Mail&amp;quot; or &amp;quot;Calendar&amp;quot;, possibly with an optional politeness term, or a switch command, such as &amp;quot;Switch to&amp;quot; and the name of the application. Approximately 1400 such utterances occured during the field trial data collection.</Paragraph>
      <Paragraph position="1"> We performed an automatic forced alignment in order to identify and extract the relevant utterances from the digitized audio. Using the full sequence of synthesized computer utterances and manually transcribed user utterances, we applied the align function of the Sonic speech recognizer provided as part of the University of Colorado (CU) Communicator system to a 16-bit linear version of the original audio recording. 473 utterances that were correctly aligned by this automatic process were used for the current analysis.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Acoustic Feature Extraction
</SectionTitle>
    <Paragraph position="0"> Based on prior results for monologue, we selected pitch and amplitude features for consideration. Although silence duration is often a good cue to discourse segment boundary position in narrative, we excluded it from consideration in the current study due to the awkward pace of the SpeechActs human-computer interactions. Users had to wait for a tone to speak, and interturn silences were as long as six seconds.</Paragraph>
    <Paragraph position="1"> We used the &amp;quot;To Pitch...&amp;quot; and &amp;quot;To intensity&amp;quot; functions in Praat(Boersma, 2001), a freely available acoustic-phonetic analysis package, to automatically extract the pitch (in Hertz) and amplitude (in decibels) for the interaction. To smooth out local jitter and noise in the pitch and amplitude contours, we applied a 5-point median filter. Finally, in order to provide overall comparability across male and female subjects and across different channel characteristics for different sessions2, we performed per-speaker, per-session normalization of pitch and amplitude values, computed as a0a2a1a4a3a6a5a8a7a10a9a11a1a4a12a7a10a9a11a1a4a12 . The resulting pitch and amplitude values within the time regions identified for each utterance by forced alignment 2Since the interface was accessed over a regular analog telephone line from a wide variety of locations - including noisy international airports, the recording quality and level varied widely.</Paragraph>
    <Paragraph position="2">  and intensity. Light grey: Segment-initial; Dark grey: segment-final were used for subsequent analysis.</Paragraph>
  </Section>
  <Section position="7" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Prosodic Analysis
</SectionTitle>
    <Paragraph position="0"> For both pitch and amplitude we computed summary scalar measures for each utterance. Mean pitch and intensity are intended to capture overall increases or decreases. Maximum and minimum pitch and maximum amplitude served to describe increases in range that might not affect overall average. We compared the segment-initial &amp;quot;application change&amp;quot; utterances with their immediately preceding segment-final utterances.3 We find significant increases in maximum pitch (a13a15a14a16a13a11a17a19a18a20a13a4a21a22a13a24a23a26a25a27a14a28a13a11a29a31a30a24a32a33a17a35a34a36a21a33a37a39a38 a40a42a41a40a44a43 ), mean pitch (a37a45a38 a40a42a41a40a42a46 ), and mean amplitude (a37a47a38 a40a42a41a40a48a40a49a46 ) of segment-initial utterances relative to segment-final cases. We also find highly significant decreases in minimum pitch (a37a50a38 a40a42a41a40a48a40a44a40a42a46 ) for segment-final utterances relative to segment-initial utterances. Changes in maximum amplitude did not reach significance. Figure 5 illustrates these changes.</Paragraph>
  </Section>
  <Section position="8" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Discussion
</SectionTitle>
    <Paragraph position="0"> The significant increases in maximum and mean pitch for segment-initial utterances, coupled with a decrease in pitch minimum for segment-final utterances, suggest a contrastive use of pitch range across the segment boundary. For amplitude, there is a global increase in intensity. These basic features of discourse segment-initial versus discourse segment-final utterances are consistent with the 3For consistency, we excluded utterances that participated in error spirals, and segment-final utterances which were also segment-initial.</Paragraph>
    <Paragraph position="1"> prior findings for monologue. It is interesting to note that in spite of the less than fluent style of interaction imposed on users by the prototype system, cues to discourse segment structure remain robust and consistent. We also observe that the contrasts across discourse segment boundaries are based on the speaker's own baseline prosodic behavior, rather than the conversational partner's, at least in this largely user-initiative system.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML