XML Viewer - w97-1201

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/w97-1201_intro.xml
Size: 25,570 bytes
Last Modified: 2025-10-06 14:06:26
<?xml version="1.0" standalone="yes"?>
<Paper uid="W97-1201">
  <Title>Probabilistic Model of Acoustic/Prosody/ Concept Relationships for Speech Synthesis</Title>
  <Section position="3" start_page="2" end_page="2" type="intro">
    <SectionTitle>
2 Probabilistic Mapping of
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
Acoustic/Prosody/Concept
Relationships
2.1 Model Formalism
</SectionTitle>
      <Paragraph position="0"> If the task of automatic speech synthesis can be framed as that of selecting the most probable acoustic production associated with a text string annotated for intended meaning, it can be achieved by finding ar gmaxxp( x\]meaning ) . (1) That is, by finding the acoustic sequence x that maximizes the joint probability of the acoustic sequence given the meaning or concept the speaker wishes to convey. For this work, the word string itself is assumed to have been generated in a prior step. Therefore, the acoustic sequence would be a sequence of prosody related supra-segmental features. The term meaning is quite open to interpretation. For the purposes of this work, meaning will be assumed to be straightforwardly encoded in the information structure, that is, the syntax and semantic structure of the utterance. For the present, information structure (denoted by I in the equations below) represents some aspects of the underlying concept that are covered by theory (such as syntax) or description (such as an instantiation of a focussed constituent).</Paragraph>
      <Paragraph position="1"> This structure will be represented as annotations on the text string, which can serve as input to a speech synthesis system. What features, beyond syntax, are relevant and can be reliably annotated is, of course, an open research question, that will probably be answered over time. Some suggestions for features and a method for incorporating them in this probabilistic model are described in Section 3. So, returning to the model derivation, we see that we wish to find argmaxxp(x\]I), (2) that is, the acoustic sequence x that maximizes the probability of the acoustics given the informational annotation. Inserting prosody as an intermediary representation, we can re-write this conditional probability of the acoustics given the information structure in terms of the sequence of abstract prosodic labels a.</Paragraph>
      <Paragraph position="3"> This sequence of abstract prosodic labels describe an ordered set of prosodic events, such as prominence and phrasing.</Paragraph>
      <Paragraph position="4"> In order to capture these prosodic events in a computational model, this work uses prosodic labels based on the ToBI transcription convention (--Tones and Break Indices, see (Silverman et.al., 1992)). The ToBI system labels prosodic prominence with a pitch accent type (Tone), from a subset of Pierrehumbert's inventory (Pierrehumbert80, 1980). In previous work, data with this level of detail was not available so the simple label of +- prominence on each syllable was used. Prosodic phrasing is captured by placing a break index (BI: 0 to 4) at each word juncture, to indicate the level of de-coupling between the words. For example, the juncture between two words in a clitic group would be labeled with a 0 break index. At the other end of the spectrum, the junction between two words separated by a major prosodic phrase would be labeled with a 4 break index. Moreover, the ToBI system allows one to label prosodic events that are conspicuous in spontaneous speech. For example, significant lengthening of a final syllable, without the comcommittant intonational cues associated with prosodic phrasing can be labeled with a diacritic. If these events serve a communicative purpose in informal speech (marking focus, holding the floor) then future models may make use of these labels.</Paragraph>
      <Paragraph position="5"> Applying Bayes' Law to Equation 3:</Paragraph>
      <Paragraph position="7"> This form of the equation more clearly reflects the use of prosody as an intermediate representation, relating the acoustic sequence to the information structure by modeling p(xlI ) in terms of the conditional probability of acoustic sequence given the prosodic structure (p(xla)) and the conditional probability of prosody given the information structure (p(alI)). The model parameters can be estimated using statistical methods: in this work, using an automatically derived binary decision trees.</Paragraph>
      <Paragraph position="8"> In speech synthesis applications, each prosodic event can be realized by manipulating the acoustic signal according to a set of context-based f0 and duration rules (vanSanten, 1993). For example, a syllable labeled with a high pitch accent (H* or simply +prominence from this work), would be given a rise/fall f0 contour, adjusted according to e.g. the duration of the vowel, etc. In such applications, p(x/a), the probability of the acoustic parameters given the prosodic labels, is fully determined by these rules alone. Although a stochastic model of prosody and acoustic features could be useful in more specifically determining the correlates of prosody in f0, duration and vowel quality, it is left to future work to incorporate a probabilistic acoustic model of prosody in the synthesis algorithm. Therefore, with this simplification, the model equation becomes: null</Paragraph>
      <Paragraph position="10"> assuming a Markov dependency.</Paragraph>
      <Paragraph position="11"> Here \[al ...an\] is the sequence of n prosodic labels in the utterance. The task remains, therefore, to find the sequence of prosodic labels with the highest probability given the underlying concept to be conveyed I.</Paragraph>
    </Section>
    <Section position="2" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.2 Decoding Algorithms
</SectionTitle>
      <Paragraph position="0"> Several decoding algorithms have been proposed to find the best sequence a. If prosodic labels are assumed to be independent, as in Equation 6, the highest probability sequence will be the sequence of highest probability elements p(ai\]I).</Paragraph>
      <Paragraph position="1"> Note that, until this point, no assumptions have been made about the dependence or independence of the sequences of acoustic, prosodic or informational events. Such assumptions are certainly incorrect. For example, both Prevost(Prevost, 1996) and Selkirk(Selkirk, 1997) propose prosodic structure that involves a combination of prominence and phrase boundary placement to cue meaning-specific speech renditions. Furthermore, though useful to simplify the decoding problem, independence assumptions are probably not viable once demands on spoken language systems become more sophisticated. null Several prosodic models have been developed that relax this assumption. For example, work in predicting prominence by (Ross et.al., 1992) makes use of a Markov assumption. Also, a hierarchical model (Ostendorf and Veilleux, 1993), makes use of the strict layer hypothesis of prosodic phrase structure, assuming that a well-formed utterance is comprised of major phrases, which are in turn comprised of minor phrases. In that work, a dynamic programming algorithm (see Figure 2.2) proposes a major phrase within an utterance, uses hypothetical minor phrases within that major phrase to estimate its probability and then, finally, choses the most likely sequence of major phrases. The most likely hypothesis of minor phrases within each major phrase is determined by using probability estimates from a previously derived binary decision tree. As we will see in the next section, the binary decision tree provides estimates of (p(ak\[T(Wi,mprev))), that is, the probability of a particular prosodic event (phrase break indices here), given the word sequence and the previous minor phrase.</Paragraph>
      <Paragraph position="2"> Perceptual experiments have been performed to try to investigate whether the hierarchical model described above improves the intelligibility of synthesized speech. Evaluations of this sort are notably difficult, but necessary. Instead of trying to evaluate the &amp;quot;naturalness&amp;quot; of synthetic speech, Silverman et.al. (Silverman, 1993; Boogaartand Silverman, 1992) have suggested a transcription or responsetype task to evaluate comprehensibility. Improved comprehensibility should manifest itself as improved transcription performance. However, in the perceptual experiment used to evaluate the hierarchical model, subjects performed similarly well on a transcription task designed to compare three different prosodic phrase break models on an AT&amp;T TTS system (the AT&amp;T default, the hierarchical model and a random generator)2(.62-.67 correct, a 2 ,~ 0.5), including randomly placed breaks. Informal discussion with human subjects revealed that the task was considered very difficult. Although several subjects claimed to have understood the sentences, they said that they didn't have enough time to transcribe the sentence.</Paragraph>
      <Paragraph position="3"> In addition to transcribing the synthetic sentences, subjects were also asked to check which, of five adjectives (&amp;quot;choppy&amp;quot;, &amp;quot;okay&amp;quot;, had &amp;quot;not enough pauses&amp;quot;, or &amp;quot;unnatural&amp;quot;), best described the phrasing of the sentences. The results of this experiment for twenty subjects are tabulated on Table 1. Overall, more hierarchical model sentences were judged to be &amp;quot;okay&amp;quot;. However, hierarchical model sentences were judged to be &amp;quot;choppy&amp;quot; more often than AT&amp;T</Paragraph>
      <Paragraph position="5"> Save pointers to best previous break location s.</Paragraph>
      <Paragraph position="6"> To find the most likely sequence, p(Ui\[)/~)i, Ui_l ) --- maxn logpt, (U/I,---, Uin\[~4)i, Ui-1 ) + log q(nlli ). Here, the probability q(n\[li) provides a length constraint.</Paragraph>
      <Paragraph position="7"> sentences, which in turn were more often considered to have &amp;quot;not enough pauses&amp;quot;. Interestingly, the hierarchical model sentences were not considered more choppy than the random model's sentences, despite the minor phrase breaks generated by the hierarchical model in addition to major breaks. Moreover, the hierarchical model was given significantly more &amp;quot;Okay&amp;quot; ratings than the other two models.</Paragraph>
      <Paragraph position="8"> Based on other evaluation metrics, the performance of the hierarchical model is believed to be of higher quality in general text-to-speech tasks. Although there are specific syntactic structures that are consistently problematic (e.g. particles and prepositions), improved POS labeling and additional rules in text-processing can easily reduce these problems. Furthermore, this work did not take into consideration discourse or non-syntax semantic information, and did not alter the AT&amp;T defaults for placing prominences (pitch accents).</Paragraph>
    </Section>
    <Section position="3" start_page="2" end_page="2" type="sub_section">
      <SectionTitle>
2.3 Parameter Estimation Using Binary
Decision Trees
</SectionTitle>
      <Paragraph position="0"> For the hierarchical model, or any future model based on the more general acoustic/prosody/syntax formalism presented here, we need to have estimates for p(ai/Z) in order to decode the most probable prosodic sequence. Binary decision trees are used in this work to estimate p(ajZ) for several reasons, and are a main feature supporting the generality and adaptability of the overall model. First, binary trees can be used to map heterogeneous features to prosodic labels. Features can be continuous (e.g. degree of syntactic bracketing) or discrete (e.g.</Paragraph>
      <Paragraph position="1"> classes of function word types). Furthermore, the features can be inter-dependent, such as location of the last predicted phrase boundary or prominence.</Paragraph>
      <Paragraph position="2"> The automatic algorithm which determines the tree structure typically chooses features to minimize mis-classification of prosodic labels, and as such, can indicate how relevant a feature is in the choice of a particular prosodic label. Finally, and most importantly from a theoretical standpoint, a decision tree presents a model of the relationship between one domain, e.g. information structure (or alternatively acoustics) and another domain, e.g. prosody. As shown below, the path from root node to leaf is unique for each token in the input sequence and describes a prosodic label as a function of the tree features. (Let T(Z) represent a function T of the information structure Z). This model therefore allows us to map information structure onto prosodic structure, or, alternatively, to map acoustic parameters onto prosodic structure.</Paragraph>
      <Paragraph position="3"> Binary decision trees, like the one shown in Figure 1, are a series of binary questions about features (text-derived syntactic, grammatical and pragmatic properties in this case). Each data token (a data token would be a word pair in trees designed to predict prosodic phrase breaks and a single syllable to predict :i: prominence), is &amp;quot;dropped&amp;quot; through the tree, starting with the root question. In the example shown here, a binary decision tree has been grown to predict the phrase break index between each word pair. The root node represents the question Is the left word a content word and the right word a function word? If the answer is no, the word pair is dropped to the lower right node, and examined in the light of that node's question (What function word class is the right word?). The process is repeated until the data token reaches a leaf (terminal node). In some cases (Wang and Hirschberg, 1992) the leaf node would be associated with a specific prediction, e.g. phrase break index 1, and the word pairs shunted to this leaf node would be pre- null /~acb data token (a word pair) is shunted along a path from the root to a leaf node. Features include syntaztic derivations (e.g. dom_Ift means &amp;quot;what is the highest syntactic constituency that dominates the left word but not the right word&amp;quot;), word class type, degree of syntactic bracketing, relative position in the syntax tree and relative position in the sentence. The word pairs from the sentence The men won over their enemies are shown under their destination leaves.</Paragraph>
      <Paragraph position="4"> Table h Results for subjective judgments by 20 human subjects for nine sentences synthesized using each of three prosodic phrase break prediction algorithms (AT~T default, hierarchical model, and randomly placed bi'eaks).</Paragraph>
      <Paragraph position="5">  dicted to be separated by a typical intra-phrase word break. However, the approach taken here differs in that trees are not used directly to predict phrase breaks or prominences. Instead the trees are used to generate conditional probability distributions (in this example, p(break indexi\[T(syntax))). The distributions are then used in the computational model described above to find the most likely sequence of prosodic labels. Again, the advantage of decoding the entire sequence is that one is able to explicitly make use of the inter-dependencies in the assignment of specific prosodic labels as in (Ostendorf and Veilleux, 1993). Furthermore, focus might be marked by a confluence of prominence and phrasing (Selkirk, 1997), requiring a relaxation of the assumption that these events are independent.</Paragraph>
      <Paragraph position="6"> In order to see how a decision tree is used as a conditional probability estimator rather than as a predictor, recall how such a tree is originally constructed. Binary decision trees, like most statistical models, are derived using labeled training data. In this case, the training data would be hand-labeled prosodically (ToBI) and would be associated with features automatically extracted for each data token. The tree in the example above was derived to estimate the conditional probability of the prosodic  To generate this p(bi\[T(syntax)) tree, data tokens (word pairs) were hand-labeled for prosodic break indices, analyzed by an automatic parser and other feature extraction programs equipped with e.g.</Paragraph>
      <Paragraph position="7"> function word tables and dictionaries, to produce a labeled database. A similar tree was also generated to estimate the probability of =k prominence, given similar syntactic/pragmatic features. To estimate the conditional probabilities between prosodic labels and acoustic signal, trees have also been derived using acoustic features such as normalized f0, duration, and vowel quality (Wightman, 1991).</Paragraph>
      <Paragraph position="8"> After the database is fully labeled, an automatic tree-growing algorithm partitions this training data based on one of these extracted features (in Figure 1 the cw-fw feature) into two subsets, each more homogeneous with respect to break indices than the whole set 3. The partitioning process is repeated on each of the subsets, using one of the extracted features and each time producing children subsets that are more homogeneous than their parents. Ideally, the final subsets, which share a particular syntactic/ pragmatic context, would contain data where all word pairs had been labeled with the same break index, prompting only a single prediction. However, this is unlikely, since all factors that determine prosody are either not yet understood, such as focus structure, or can not be known from text, such as speaking rate, and thus are not used in the partitioning process. Instead, each of the final subsets has a distribution of prosodic labels that can be used to estimate a probability distribution p(a\[leafi ). In this example, each leaf represents a unique path which serves to describe a distribution of prosodic labels as a function of the syntax and related features. Therefore p(alleaf i ) = p(alW(syntax)).</Paragraph>
      <Paragraph position="9"> One interesting observation from the automatic design of the decision tree shown in Figure 1 is the selection of the cw-fw feature at the root. Sorin (Sorin et.al, 1987) has shown that prosodic phrase breaks in French tend to correspond with just cw-fw junctions. Although this rule over-generates phrase boundaries in English, the choice of this feature in the decision tree indicates a persistent correlation.</Paragraph>
      <Paragraph position="10"> ~In this case, words pairs spanning a content-function word boundary have been found to be associated with intonational phrase breaks (indices 3 and 4) and the data set that contains word pairs that are cw-~T pairs has a higher percentage of 3/4 break index labels than the whole data set.</Paragraph>
      <Paragraph position="11"> 3 Use of semantic features in the prosody/concept mapping Previous work has made use of this probabilistic model of the relationships between prosody, the acoustic signal and information structure (Veilleux, 1996) but only insofar as information structure could be captured using syntax and related features. From the examples above, it is clear that prosody, even prosodic phrase structure, is not constrained by syntax alone. What remains to be investigated and incorporated are other factors that constrain prosody and are related to the concept the speaker intends to convey.</Paragraph>
      <Paragraph position="12"> One such feature of the information structure is the placement of focussed constituents in an utterance. The literature presents a variety of definitions of semantic focus, some describing focus in terms of semantic intent (Rooth, 1994; Gussenhoven, 1994) and others more directly in relationship to givenhess (Schwarzschild, 1997). Furthermore, some definitions of focus overlap with theme (Prevost, 1996), while others do not. In any case, focus is generally agreed to be linked to pitch accent placement (e.g.</Paragraph>
      <Paragraph position="13"> (Selkirk, 1984)) and probably to phrase break placement as well (Selkirk, 1997)).</Paragraph>
      <Paragraph position="14"> This focus/prosody relationship presents an opportunity to generate synthetic speech that has a more appropriately assigned prosodic structure, reflecting the underlying meaning to be conveyed. As the mapping between prosody and focus is investigated more fully, results can be incorporated into the computational model presented here by simply representing focus markings as labeled features in the binary tree.</Paragraph>
      <Paragraph position="15"> Some promising work by Selkirk (Selkirk, 1997) describes the choice of prosodic phrase structure to be the outcome of an ordering of competing factors, including focus, syntax and pragmatic constraints.</Paragraph>
      <Paragraph position="16"> While some constraints may be violable (such as the alignment of major prosodic breaks with syntactic boundaries), the outcome is optimal, i.e. it conforms to the constraints of the highest ranked factor (e.g.</Paragraph>
      <Paragraph position="17"> the alignment of a major prosodic phrase boundary with the right edge of a focussed constituent).</Paragraph>
      <Paragraph position="18"> Previous use of the acoustic/prosody/syntax model has already established the function of syntactic edges to predict prosodic phrasing (note the use of e.g. dom..lft features in the tree given in Figure 1). Labeling the right edges of focussed constituents in training data and growing a binary decision tree based on this additional feature, will generate probability distributions as functions of the focus as well as syntactic and pragmatic structure. If the supposition about the relationship between focussed constituents and prosodic boundaries is represented in data, such a feature should be selected as useful in decreasing the mis-classification of prosodic phrase break indices between two words in the automatic design of a binary tree. Moreover, a ranking implies an interaction of factors, each of which can be encoded as binary tree features. The order in which the features are selected in the tree structure, as well as their co-occurance (or lack of) on a root-leaf path, can indicate potential areas of interaction or redundancy. In this way, binary decision trees not only generate conditional probabilities for synthesis models, but also test hypotheses about the relative use of a feature in predicting a prosodic label.</Paragraph>
      <Paragraph position="19"> Work by Prevost explicitly addresses the relationship between theme-rheme and prosodic prominence and phrase placement in cases of explicit contrast.</Paragraph>
      <Paragraph position="20"> This work significantly extends previous heuristics concerning newness and pitch accent placement.</Paragraph>
      <Paragraph position="21"> Again, training data labeled with theme-rheme notation, and devising a feature for the decision tree growing algorithm to select, would incorporate this rule in the estimation of probabilities of prosodic structure.</Paragraph>
      <Paragraph position="22"> Another active and related area of research that addresses the relationship between higher order linguistic structure and prosodic structure has been explored by (Terken and Hirshberg, 1994) and (Nakatani, 1993). The latter work examines the placement of accents, as constrained by the interaction of discourse, surface structure and lexical form. Pitch accent placement on pronouns as well as on explicit forms in the subject position motivate theory that describes new and givenness in terms of a hierarchical discourse structure (Grosz and Sidner 1986). Again, the implications of this theoretical framework can be extracted as features for generating conditional probabilities of prosodic events, with reference to the theory. One such feature could be an annotation of discourse segmentation in the input text. Using this annotation as a feature in the binary tree would also serve to allow the tree to chose limits on how far back in the history list to look for an antecedent. If the pronoun represents a new (re-introduced) item within this window, it may be more likely to be accented. Again, as a feature in a decision tree, this property would be a candidate for selection to minimize mis-classification error and generate conditional probabilities that are functions of the discourse environment.</Paragraph>
      <Paragraph position="23"> In summary, the formalism for incorporating emerging linguistic theory in a joint model of the acoustic/prosody/concept relationships is described here. It makes use of binary decision trees to estimate model parameters, the conditional probabilities. The binary decision trees themselves, make use of explicit linguistic infdrmation to partition data into more homogeneous prosodic contexts. In doing so, the model remains general, and can accommodate the results of our evolving understanding of the interaction between factors that determine prosody.</Paragraph>
      <Paragraph position="24"> \Y=hile this model has been successful in both speech synthesis and analysis applications, it has made use of syntactic and pragmatic information alone. Extension of this model to map prosodic structure to other higher order linguistic structures that more fully describe the meaning that an utterance is to convey is straightforward. As hypotheses are developed in the ranking of competing constraints, including focus structure, and in the role of discourse history, they can be integrated into the model as features in the binary decision tree.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML