XML Viewer - h89-2003

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/89/h89-2003_intro.xml
Size: 4,511 bytes
Last Modified: 2025-10-06 14:04:49
<?xml version="1.0" standalone="yes"?>
<Paper uid="H89-2003">
  <Title>TIMING MODELS FOR PROSODY AND CROSS-WORD COARTICULATION IN CONNECTED SPEECH</Title>
  <Section position="2" start_page="0" end_page="13" type="intro">
    <SectionTitle>
INTRODUCTION
</SectionTitle>
    <Paragraph position="0"> Varlation in timing is one of the most pervasive features of speech. It plays a role at all levels. A particular pattern of vowel lengthening, for example, can cue the segmental contrast between \[ae\] and \[El in 'bad' versus 'bed' and between the following \[d\] and \[t\] in 'bad' versus 'bat' (e.g., Nooteboom 1973; Klatt 1976; Raphael 1972). In speech synthesis, manipulating the timing pattern by changing the lengths of acoustic segments can also alter the perceived stress pattern or intonational phrasing of an utterance (e.g., Fry 1958; Klatt 1979; Scott 1982). It is hardly surprising, therefore, that knowledge of segment durations can improve speech recognition. For example, Deng, Lennig, and Mermelstein (1989) have shown that information about vowel interval durations dramatically increases recognition rates in a Hidden Markov Model isolated-word recognition system. Similarly, Lieberman (1960) showed  that vowel-interval durations augmented by rudimentary RMS amplitude measures can identify stressed syllables. Using interval durations to parse the stress pattern in this way can drastically reduce the search space in large-vocabulary isolated-word recognition systems (Waibel 1988). Knowing the stress pattern should prove even more crucial to recognition of connected utterances, because of the way that stress interacts phonologically with the phrasing to cue the prosodic organization of the utterance into words and larger phonological units (Nespor and Vogel 1986; Beckman, de Jong, and Edwards 1987). An accurate prediction of assimilations, deletions, and other lenitlon rules across word boundaries also depends on the phonologoical phrasing (Nespor and Vogel 1982; Zek and Inkelas 1987).</Paragraph>
    <Paragraph position="1"> If knowledge just of acoustic interval durations can aid recognition in both isolated words and connected speech, what if we were to use finer measures of timing? There are many indications that knowledge of the temporal structure within acoustic segments could improve recognition even more. For example, in addition to being longer and having a lower first formant, \[i\] (as in 'beat') differs from \[I\] (as in 'bit') in having a faster, shorter second formant transition that starts later in the syllable (Neary and Assman 1986). Other tense-lax vowel pairs also show this difference in spectral kinematics.</Paragraph>
    <Paragraph position="2"> Similarly, in addition to being shorter in overall duration before a word-final voiceless obstruent, vowels tend to have shorter, faster first-formant transitions (Summers 1987). A better understanding of the control of such timing patterns in speech production could lead to more accurate accounts of the kinematic differences and to more wieldy predictions of interactions among the many factors that influence segment-interval duration.</Paragraph>
    <Paragraph position="3"> In the last decade, we have made tremendous advances toward a better understanding of timing control by looking in detail at the kinematics of the articulatory gestures involved in producing speech. Following a proposal by Fowler et al. (1980), speech scientists have worked at applying a general model of motor control orglnally developed to account for such things as the coordination of flexor and extensor muscles in maintaining gait across different terrains and speeds or the coordination of shoulder and elbow joints in different reaching tasks (e.g. Ostry, Keller, and Parush 1983; Kelso et al.</Paragraph>
    <Paragraph position="4"> 1985; Saltzman 1986).</Paragraph>
    <Paragraph position="5"> Two recent results of this work seem particularly relevant to achieving better recognition models. One is Browman and Goldstein's (1987) application of their task-dynamlc model to explain many common lenitlons across word boundaries in casual or fast speech. The other is Beckman, Edwards, and Fletcher's (1989) application of the model in understanding the control of three different lengthening effects associated with slow tempo, phrase-final position, and nuclear sentence stress. In the next two sections, I will describe these two results and their implications for speech recognition in more detail.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML