File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/93/h93-1065_metho.xml
Size: 21,309 bytes
Last Modified: 2025-10-06 14:13:24
<?xml version="1.0" standalone="yes"?> <Paper uid="H93-1065"> <Title>QUANTITATIVE MODELING OF SEGMENTAL DURATION</Title> <Section position="1" start_page="0" end_page="0" type="metho"> <SectionTitle> QUANTITATIVE MODELING OF SEGMENTAL DURATION </SectionTitle> <Paragraph position="0"> Jan P. H. van Santen</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="metho"> <SectionTitle> AT&T Bell Laboratories </SectionTitle> <Paragraph position="0"/> </Section> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> ABSTRACT </SectionTitle> <Paragraph position="0"> In natural speech, durations of phonetic segments are strongly dependent on contextual factors. Quantitative descriptions of these contextual effects have appfications in text-to-speech synthesis and in automatic speech recognition. In this paper, we describe a speaker-dependent system for predicting segmental duration from text, with emphasis on the statistical methods used for its construction. We also report results of a subjective listening experiment evaluating an implementation of this system for text-to-speech synthesis purposes.</Paragraph> </Section> <Section position="4" start_page="0" end_page="323" type="metho"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> This paper describes a system for prediction of segmental duration from text. In most text-to-speech synthesizer architectures, a duration prediction system is embedded in a sequence of modules, where it is preceded by modules that compute various linguistic features ~ from text. For example, the word &quot;unit&quot; might be represented as a sequence of five feature vectors: (< At/, word - initial, monosyllabic,... , >) * &quot; (< /t/~rst, word- final, monosyllabic, ... , >). In automatic speech recognition, a (hypothesized) phone is usually annotated only in terms of the preceding and following phones. If some form of lexical access is performed, more complete contextual feature vectors can be computed.</Paragraph> <Paragraph position="1"> Broadly speaking, construction of duration prediction systems has been approached in two ways. One is to use general-purpose statistical methods such as CART 2 or neural nets. In CART, for example, a tree is constructed by making binary splits on factors that minimize the variance of the durations in the two subsets defined by the split \[2\]. These methods are called &quot;general purpose&quot; because they can be used across a variety of substantive domains.</Paragraph> <Paragraph position="2"> There also exists an older tradition exemplified by Klatt \[3, 4, 5\] and others \[6, 7, 8, 9\] where duration is computed with duration models, i.e., simple arithmetic models specifically designed for segmental duration. For example, in Klatt's lWe define a factor, FFi, to be a partition of mutually exclusive and exhaustive possibilities such as {1-stressed, 2-stressed, unstressed}. A feature is a &quot;level&quot; on a factor such as 1-stressed. The feature space F is the product space of all factors: Fl x -. - x Fn. Because of phonotactic and other constraints, only a small fraction of this space can actually occur in a language; we call this the linguistic space.</Paragraph> <Section position="1" start_page="0" end_page="323" type="sub_section"> <SectionTitle> 2Classification and Regression Trees \[1 \]. </SectionTitle> <Paragraph position="0"> model the duration for feature vector f E F is given by</Paragraph> <Paragraph position="2"> Here, fj is the j-th component s of the vector f, the second subscript (j) in s~,j likewise refers to this component, and the first subscript (i) refers to the fact that the model consists of twoproduct terms numbered 1 and 2. The parameters si,./are called factor scales. For example, Sl,l (stressed) = 1.40.</Paragraph> <Paragraph position="3"> All current duration models have in common that they (1) use factor scales, and (2) combine the effects of multiple factors using only the addition and multiplication operations. The general class of models defined by these two characteristics, sums-of-productsmodels, has been found to have useful mathematical and statistical properties \[10\].</Paragraph> <Paragraph position="4"> Briefly, here is how these two standard approaches compare with ours. We share with general-purpose statistical methods the emphasis on formal data analysis methods, and with the older tradition the usage of sums-of-products models. Our approach differs in the following respects. First, although we concur with the modeling tradition that segmental duration data - and in particular the types of interactions one often finds in these data - can be accurately described by sums-of-products models, this class of models is extremely large so that one has to put considerable effort in searching for the most appropriate model. 4 The few models that this tradition has generated make up a vanishingly small portion of a vast space of possibilities, and because they have not been systematically tested against these other possibilities \[11\] we should consider the search for better models completely open. Second, in contrast with the general-purpose methods approach, the process by which we construct our prediction system is not a one-step procedure but is a multi-step process with an important role being played by various forms of exploratory data analysis.</Paragraph> </Section> </Section> <Section position="5" start_page="323" end_page="325" type="metho"> <SectionTitle> 2. PROPERTIES OF SEGMENTAL DURATION DATA </SectionTitle> <Paragraph position="0"> In this section, we first discuss properties of segmental duration data that pose serious obstacles for prediction, and next properties that may help in overcoming these obstacles.</Paragraph> <Section position="1" start_page="323" end_page="324" type="sub_section"> <SectionTitle> 2.1. Interactions between contextual factors </SectionTitle> <Paragraph position="0"> A first reason for duration prediction being difficult is that segmental duration is affected by many interacting factors. In a recent study, we found eight factors to have large effects on vowel duration \[12\], and if one were to search the literature for all factors that at least one study found to have statistically significant effects the result would be a list of two dozen or more factors \[13, 14, 15\].</Paragraph> <Paragraph position="1"> nants in two stress conditions: unstressed~stressed and stressed~unstressed.</Paragraph> <Paragraph position="2"> These factors interact in the quantitative sense that the magnitude of an effect (in ms or in percent) is affected by other factors. Table 1 shows durations of intervocalic consonants in two contexts defined by syllabic stress: preceding vowel unstressed/following vowel stressed (Ifl in &quot;before&quot;); and: preceding vowel stressed/following vowel unstressed (/f/in &quot;buffer&quot;; It/is usually flapped in this context). The Table shows that the effects of stress are much larger for some consonants than for others: a consonant x stress interaction.</Paragraph> <Paragraph position="3"> Other examples of interactions include postvocalic consonant x phrasal position and syllabic stress x pitch accent \[121.</Paragraph> <Paragraph position="4"> These interactions imply that segmental duration can be described neither by the additive model \[9\] (because the differences vary) nor by the multiplicative model \[7\] (because the percentages vary)) In contrast, the Klatt model was specif5In the additve model DUR(f) = al,l (fl) + &quot;'&quot; + Sn,n(fn); in the mulfiplicadve model DUR(f) = sl,l ( fl ) x &quot;'&quot; x Sl,n (fn). ically constructed to describe certain interactions, in particular the postvocalic consonant x phrasal position interaction. However, in an effort to use the Klatt model for text-to-speech synthesis it became clear that this model needed significant modifications to describe interactions involving other factors \[5\]. Recent tests further confirmed systematic violations of the model \[11\].</Paragraph> <Paragraph position="5"> Thus, the existence of large interactions is undeniable, but current sums-of-products models have not succeeded in capturing these interactions. General-purpose prediction systems such as CART, of course, can handle arbitrarily intricate interactions \[16\].</Paragraph> <Paragraph position="6"> 2.2. Lopsided sparsity Because there are many factors - several of which have more than two values - the feature space is quite large. The statistical distribution of the feature vectors exhibits an unpleasant property that we shall call &quot;lopsided sparsity&quot;. We mean by lopsided sparsity that the number of very rare vectors is so large that even in small text samples one is assured to encounter at least one of them.</Paragraph> <Paragraph position="7"> of contextual vectors for various sample sizes.</Paragraph> <Paragraph position="8"> Table 2 illustrates the concept. We analyzed 797,524 sentences, names, and addresses (total word token count: 5,868,172; total segment count 22,249,882) by computing for each segment the feature vector characterizing those aspects of the context that we found to be relevant for segmental duration. This characterization is relatively coarse and leaves out many distinctions (such as - for vowel duration- the place of articulation of post-vocalic consonants). Nevertheless, the total feature vector type count was 17,547. Of these 17,547 types, about 10 percent occurred only once in the entire data base and 40 percent occurred less than once in a million.</Paragraph> <Paragraph position="9"> Two aspects of the table are of interest. The second column shows that once sample size exceeds 5,000 the type count increases linearly with the logarithm of the sample size, with no signs of deceleration. In other words, although the linguistic space is certainly much smaller than the feature space, it is unknown whether its size is 20,000, 30,000, or significantly larger than that. The third column shows that even in samples as small as 320 segments (the equivalent of a small paragraph) one can be certain to encounter feature vectors that occur only once in a million segment tokens.</Paragraph> <Paragraph position="10"> It is often suspected that general-purpose prediction systems can have serious problems with frequency imbalance in the training set, in particular when many feature vectors are outright missing. Experiments performed with CART confirmed this suspicion. In a three-factor, 36-element feature space, with artificial durations generated by the Klatt model, we found that removing 66 percent of the feature vectors from the training set produced a CART tree that performed quite poorly on test data. Similarly, neural nets can have the property that decision boundaries are sensitive to relative frequencies of feature vectors in the training sample (e.g., \[17\]), thereby leading to poor performance on infrequent vectors.</Paragraph> <Paragraph position="11"> The key reason for these difficulties is that the ability to accurately predict durations for feature vectors for which the training set provides few or no data points is a form of interpolation, which in turn requires assumptions about the general form of the mapping from the feature space onto durations (the response surface). Precisely because they are generalpurpose, these methods make minimal assumptions about the response surface, which in practice often means that the duration assigned to a missing feature vector is left to chance. For example, in CART an infinitesimal disturbance can have a major impact on the tree branching pattern. Even when this has little effect on the fit of the tree to the training data, it can have large effects on which duration is assigned to a missing feature vector. In subsection 2.4, we will argue that the response surface for segmental duration can be described particularly well by sums-of-products models, so that these models are able to generate accurate durations for (near-) missing feature vectors.</Paragraph> <Paragraph position="12"> It should be noted that for certain applications, in particular automatic speech recognition, poor performance on infrequent feature vectors need not be critical because lexical access can make up for errors. Current implementations of text-to-speech synthesis systems, however, do not have error correction mechanisms. Having a seriously flawed segmental duration every few sentences is not acceptable.</Paragraph> </Section> <Section position="2" start_page="324" end_page="325" type="sub_section"> <SectionTitle> 2.3. Text-independent variability </SectionTitle> <Paragraph position="0"> A final complicating aspect of segmental duration is that, given the same input text, the same speaker (speaking at the same speed, and with the same speaking instructions) produces durations that are quite variable. For example, we found that vowel duration had a residual standard deviation of 21.4 ms, representing about 15 percent of average duration.</Paragraph> <Paragraph position="1"> This means that one needs either multiple observations for each feature vector so that statistically stable mean values can be computed, or data analysis techniques that are relatively insensitive to statistical noise.</Paragraph> <Paragraph position="2"> In large linguistic spaces, text-independent variability implies that training data may require tens of thousands of sentences, even if one uses text selection techniques that maximize coverage such as greedy algorithms\[20\]. And even such texts will still contain serious frequency imbalances.</Paragraph> <Paragraph position="3"> 2.4. Ordinal patterns in data A closer look at the interactions in Table 1 reveals that they are, in fact, quite well-behaved, as is shown by the following patterns: Pattern 1. The durations in the first column are always larger than those in the second column.</Paragraph> <Paragraph position="4"> Pattern 2. The effects of stress - whether measured as differences or as percentages - are always larger for alveolars than for labials in the same consonant class (i.e., having the same manner of production and voicing feature).</Paragraph> <Paragraph position="5"> Pattern 3. Within alveolars and labials, the effects of stress (measured as differences) have the same order 6 over consonant classes (voiceless stop bursts largest, voiced stop bursts smallest).</Paragraph> <Paragraph position="6"> Pattern 4. However, the order of the durations of the consonants is not the same in the two stress conditions. For example, It/is longer than In/in the first column, but much shorter in the second column.</Paragraph> <Paragraph position="7"> This pattern of reversals and non-reversals, or ordinal pattern, can be captured by the following sums-of-product model:</Paragraph> <Paragraph position="9"> Here, C is consonant class, P place of articulation, and S stress condition; it is assumed that factor scales have positive values only. It is easy to show that this model implies Patterns 1-3 (for differences). Pattern 4 is not in any way implied by the model, but can be accommodated by appropriate selection of factor scale values. This accommodation would not be possible if the second term had been absent.</Paragraph> <Paragraph position="10"> There are many other factors that exhibit similarly regular ordinal patterns \[11, 12, 18\]. In general, factors often interact, but the interactions tend to be well-behaved so that the response surface can be described by simple sums-of-products models.</Paragraph> <Paragraph position="11"> Now, showing that an ordinal pattern can be captured by a sums-of-products model does not imply that there aren't many other types of models that can accomplish the same.</Paragraph> <Paragraph position="12"> Intuitiw~ly, it would appear that ordinal patterns are not terribly constraining. However, there exist powerful mathematical results that show this intuition to be wrong \[19\]. For example, there are results showing that if data exhibit a certain ordinal pattern then we can be assured that the additive model will fit. Similar results have been shown for certain classes of sums-of-products models (see \[19\], Ch. 7). Taken together these results make it quite plausible that when data exhibit the types of ordinal patterns often observed in segmental duration, some sums-of-products model will fit the data.</Paragraph> <Paragraph position="13"> To really make the case for the importance of ordinal patterns, we must make the further key assumption that the ordinal patterns of the response surface discovered in the training data base can be feund in the language in general (restricted to the same speaker and speaking mode). This is based on the belief that the structure discovered in the data is the result of stable properties of the speech production apparatus. For example, the non-reversal of the syllabic stress factor can be linked to the supposition that stressed syllables are pronounced with more subglottai pressure, increased tension of the vocal chords, and larger articulatory excursions than unstressed syllables. A systematic by-product of these differences would be a difference in timing.</Paragraph> </Section> </Section> <Section position="6" start_page="325" end_page="325" type="metho"> <SectionTitle> 3. SYSTEM CONSTRUCTION </SectionTitle> <Paragraph position="0"> We now describe construction of a duration prediction system based on sums-of-products models.</Paragraph> <Section position="1" start_page="325" end_page="325" type="sub_section"> <SectionTitle> 3.1. Training data </SectionTitle> <Paragraph position="0"> The data base is described in detail elsewhere \[12\]. A male American English speaker read 2,162 isolated, short, meaningful sentences. The utterances contained 41,588 segments covering 5,073 feature vector types. Utterances were screened for disfluencies and re-recorded until none were observed. The database was segmented manually aided by software which displays the speech wave, spectrogram, and other acoustic representations. Manual segmentation was highly reliable, as shown by an average error of only 3 ms (this was obtained by having four segmentors independently segment a set of 38 utterances).</Paragraph> <Paragraph position="1"> 3.2. Category structure First, we have to realize that modeling segmental duration for the entire linguistic space with a single sums-of-products model is a lost cause because of the tremendous heterogeneity of this space in terms of articulatory properties and phonetic and prosodic environments. For example, the factor &quot;stress of the surrounding vowels&quot; was shown to be a major factor affecting durations of intervocalic consonants; however, this factor is largely irrelevant for the - barely existing - class of intervocalic vowels. Thus, we have to construct a category structure, or tree, that divides the linguistic space into categories and develop separate sums-of-products models for these categories. In our system, we first distinguish between vowels and consonants. Next, for consonants, we distinguish between intervocalic and non-intervocalic consonants. Nonintervocalic consonants are further divided into consonants occurring in syllable onsets vs. non-phrase-final syllable codas vs. phrase-final syllable codas. Finally, all of these are split up by consonant class. Note that construction of this category structure is not based on statistical analysis but on standard phonetic and phonological distinctions.</Paragraph> <Paragraph position="2"> 3.3. Factor relevance and distinctions For each category (e.g., non-intervocalic voiceless stop bursts in syllable onsets), we perform a preliminary statistical analysis to decide which factors are relevant and which distinctions to make on these factors (see \[12\] for details).</Paragraph> <Paragraph position="3"> 3.4. Model selection We already hinted that the number of distinct sums-of-products models increases sharply with the number of factors; for example, for five factors there are more than 2 billion sums-of-products models, and for the eight factors we used for modeling vowel duration there are more than 1076 models. 7 Thus, in cases with more than three or four factors it is computationally unattractive to fit all possible models and select the one that fits best. Fortunately, there are methods that allow one to find the best model with far less computational effort \[10, 11\] - requiring only 31 analyses (each the computational equivalent of an analysis of variance) for five factors. These methods are &quot;diagnostiC' because they can detect trends in the data that eliminate entire classes of sums-of-products models from consideration.</Paragraph> </Section> <Section position="2" start_page="325" end_page="325" type="sub_section"> <SectionTitle> 3.5. Parameter estimation </SectionTitle> <Paragraph position="0"> Once a sums-of-products is selected, parameters are estimated with a weighted least-squares method using a simple parameter-wise gradient technique.</Paragraph> </Section> </Section> class="xml-element"></Paper>