File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/03/n03-3007_intro.xml

Size: 9,035 bytes

Last Modified: 2025-10-06 14:01:40

<?xml version="1.0" standalone="yes"?>
<Paper uid="N03-3007">
  <Title>Word Fragment Identification Using Acoustic-Prosodic Features in</Title>
  <Section position="3" start_page="0" end_page="0" type="intro">
    <SectionTitle>
2 Acoustic and Prosodic Features
</SectionTitle>
    <Paragraph position="0"> Our hypothesis is that when the speaker suddenly stops in the middle of a word, some prosodic cues and voice quality characteristics exist at the boundary of word fragments; hence, our approach is to extract a variety of acoustic and prosodic features, and build a classifier using these features for the automatic identification of word fragments.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Prosodic Features
</SectionTitle>
      <Paragraph position="0"> Recently, prosodic information has gained more importance in speech processing (Shriberg and Stolcke, 2002).</Paragraph>
      <Paragraph position="1"> Prosody, the &amp;quot;rhythm&amp;quot; and &amp;quot;melody&amp;quot; of speech, is important for extracting structural information and automating rich transcriptions. Past research results suggest that speakers use prosody to impose structure on both spontaneous and read speech. Such prosodic indicators include pause duration, change in pitch range and amplitude, global pitch declination, and speaking rate variation. Since these features provide information complementary to the word sequence, they provide a potentially valuable source of additional information. Furthermore, prosodic cues by their nature are relatively unaffected by word identity, and thus may provide a robust knowledge source when the speech recognition error rate is high.</Paragraph>
      <Paragraph position="2"> In the following we describe some of the prosodic features we have investigated for the word fragment detection task. These prosodic features have been employed previously for the task of detecting structural information in spontaneous speech such as sentence boundary, disfluencies, and dialog act. Experiments have shown that prosody model yields a performance improvement when combined with lexical information over using word level information alone (Shriberg and Stolcke, 2002).</Paragraph>
      <Paragraph position="3"> We used three main types of prosodic features -- duration, pitch and energy. Duration features were extracted from the alignments obtained from the speech recognizer. Examples of duration features are word duration, pause duration, and duration of the last rhyme in the word. Duration features are normalized in different ways such as by using the overall phone duration statistics, and speaker-specific duration statistics.</Paragraph>
      <Paragraph position="4"> To obtain F0 features, pitch tracks were extracted from the speech signal and then post-processed by using a lognormal tied mixture model and a median filter (Sonmez et al., 1997), which computes a set of speaker-specific pitch range parameters. Pitch contours were then stylized, fit by a piecewise linear model. Examples of pitch features computed from the stylized F0 contours are the distance from the average pitch in the word to the speaker's baseline F0 value, the pitch slope of the word before the boundary, and the difference of the stylized pitch across word boundary.</Paragraph>
      <Paragraph position="5"> For energy features, we first computed the frame-level energy values of the speech signal, then similarly to the approach used for F0 features, we post-processed the raw energy values to get the stylized energy.</Paragraph>
      <Paragraph position="6"> In addition to these prosodic features, we also included features to represent some ancillary information, such as the gender of the speaker, the position of the current word in the turn3, and whether there is a turn change. We included these non-prosodic features to account for the possible interactions between them and the other prosodic features.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Voice Quality Measures
</SectionTitle>
      <Paragraph position="0"> Human speech sounds are commonly considered to result from a combination of a sound energy source modulated by a transfer (filter) function determined by the shape of the vocal tract. As the vocal cords open and close, puffs of air flow through glottal opening. The frequency of these pulses determines the fundamental frequency of the laryngeal source and contributes to the perceived pitch of the produced sound.</Paragraph>
      <Paragraph position="1"> The voice source is an important factor affecting the voice quality, and thus much investigation focuses on the voice source characteristics. The analysis of voice source has been done by inverse filtering the speech waveform, analyzing the spectrum, or by directly measuring the airflow at the mouth for non-pathological speech. A widely used model for voice source is the Liljencrants-Fant (LF) model (Fant et al., 1985; Fant, 1995). Research has shown that the intensity of the produced acoustic wave depends more on the derivative of the glottal flow signal than the amplitude of the flow itself.</Paragraph>
      <Paragraph position="2"> An important representation of the glottal flow is given by the Open Quotient (OQ). OQ is defined as the ratio of the time in which the vocal folds are open to the total length of the glottal cycle. From the spectral domain, it can be formulated empirically as (Fant, 1997):</Paragraph>
      <Paragraph position="4"> where a10a11a12 and a10a11a14 are the amplitudes of the first and the second harmonics of the spectrum.</Paragraph>
      <Paragraph position="5"> Different phonation types, namely, modal voicing, creaking voicing and breathy voicing, differ in the 3In discourse analysis, all the contiguous utterances made by a speaker before the next speaker begins is referred to as a conversational turn.</Paragraph>
      <Paragraph position="6"> amount of time that the vocal folds are open during each glottal cycle. In modal voicing, the vocal folds are closed during half of each glottal cycle; In creaky voicing, the vocal folds are held together loosely resulting in a short open quotient; In breathy voicing, the vocal folds vibrate without much contact thus the glottis is open for a relatively long portion of each glottal cycle.</Paragraph>
      <Paragraph position="7"> For our word fragment detection task, we investigate the following voice quality related features.</Paragraph>
      <Paragraph position="8"> a22 Jitter is a measure of perturbation in the pitch period that has been used by speech pathologists to identify pathological speech (Rosenberg, 1970); a value of 0.01 represents a jitter of one percent, a lower bound for abnormal speech.</Paragraph>
      <Paragraph position="9"> The value of jitter is obtained from the speech analysis tool praat (Boersma and Wennik, 1996). The pitch analysis of a sound is converted to a point process, which represents a sequence of time points, in this case the times associated with the pitch pulses.</Paragraph>
      <Paragraph position="10"> The periodic jitter value is defined as the relative mean absolute third-order difference of the point process.</Paragraph>
      <Paragraph position="12"> (2) where a34a31 is the a24th interval and N is the number of the intervals of the point process. If no sequence of three intervals can be found whose durations are between the shortest period and the longest period, the result is undefined (Boersma and Wennik, 1996). a22 Spectral tilt is the overall slope of the spectrum of a speech or instrument signal. For speech, it is, among others, responsible for the prosodic features of accent, in that a speaker modifies the tilt (raising the slope) of the spectrum of a vowel, to put stress on a syllable. In breathy voice, the amplitudes of the harmonics in the spectrum drop off more quickly as the frequency increases than do in the modal or creaky spectra, i.e. breathy voice has a greater slope than creaky voice. Spectral tilt is measured in decibels per octave. We use a linear approximation of the spectral envelope to measure spectral tilt. The average, minimum, and maximum value of the spectral tilt for the word, and a window before the word boundary are included in the feature set.</Paragraph>
      <Paragraph position="13"> a22 OQ is defined in Equation (1), derived from the difference of the amplitude of the first and the second harmonics of the spectral envelope of the speech data. Studies have shown that the difference between these two harmonics (and thus the OQ) is a reliable way to measure the relative breathiness or creakiness of phonation (Blankenship, 1997).</Paragraph>
      <Paragraph position="14"> Breathy voice has a larger OQ than creaky voice. As an approximation, we used F0 and 2*F0 for the first and the second harmonics in the spectrum. Similar to the spectral tilt, we also computed the average, minimum, and maximum OQ value for a word duration or a window before the boundary.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML