File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1035_intro.xml

Size: 14,013 bytes

Last Modified: 2025-10-06 14:05:39

<?xml version="1.0" standalone="yes"?>
<Paper uid="H94-1035">
  <Title>Information Based Intonation Synthesis*</Title>
  <Section position="4" start_page="0" end_page="195" type="intro">
    <SectionTitle>
2. Motivation
</SectionTitle>
    <Paragraph position="0"> Meaning-to-speech systems differ from text-to-speech systems in the manner in which semantic and pragmatic information is exploited for assigning intonational features. Text-to-speech systems for unrestricted text are forced to rely on crude syntactic analyses and word classifications in making judgements about the accentability of words in an utterance, often using the strategy of previous mention whereby a word is de-accented if it (or perhaps its root) has previously occurred in some restricted segment of the text (cf. \[10\], \[15\]). The text can be divided into such meaningful discourse segments on the basis of cue phrases and paragraph boundaries.</Paragraph>
    <Paragraph position="1"> Meaning-to-speech systems, on the other hand, have been employed in applications with limited, well-defined domains where semantic and discourse level knowledge is available. For these systems, the effectiveness of the previous mention strategy can be improved by considering semantic givenness in addition to lexical givenness when deciding if a word should be de-accented.</Paragraph>
    <Paragraph position="2"> Such enhanced previous-mention heuristics, while proving quite effective in practice, have exhibited several deficiencies that have been noted by their proponents. Foremost among these is the inability of such strategies to model the seemingly contrastive nature of many accentual patterns in spoken language (\[10\]). In some cases, contrastive stress errors may sound unnatural and in the worst case may actually mislead the hearer. Another problem that has been attributed to previous-mention strategies is the tendency to include too many accents (\[15\]), potentially resulting in an inability for the hearer to determine the most important aspects of the speaker's intended message. The remainder of this section addresses these two problems and proposes explicitly modeling contrast in meaning-to-speech systems as a potential solution.</Paragraph>
    <Paragraph position="3"> A previous-mention strategy might work as follows:  * Assign accents to open-class items (e.g. nouns, verbs, other content words) * Do not assign accents to closed-class items (e.g.</Paragraph>
    <Paragraph position="4"> function words) * De-accent any words that were already mentioned  in the local discourse segment.</Paragraph>
    <Paragraph position="5"> Now consider a hypothetical application in a medical domain that produces the type of output shown in (1) when a physician fails to include a recommended procedure in a plan for treating a specific patient. 1 (1) a. You seem to have neglected to consider a WHO-RACOSTOMY procedure for this patient.</Paragraph>
    <Paragraph position="6"> b. I propose doing a LEFT thoracostomy.</Paragraph>
    <Paragraph position="7"> Using a previous-mention algorithm like the one above will produce the appropriate accentual pattern on the NP a left thoracostomy in (1)b because thoracoslomy is explicit\]y mentioned in the previous sentence.</Paragraph>
    <Paragraph position="8"> Now suppose the physician inadvertently includes the wrong procedure in the treatment plan, say a left ~horacotomy rather than the intended left thoracostomy. Example (2) shows the possible output from the system.</Paragraph>
    <Paragraph position="9"> (2) a. You seem to have confused the THORACOTOMY and THORACOSTOMY procedures in your plan for this patient.</Paragraph>
    <Paragraph position="10"> b. I propose doing a'left THORACOSTOMY.</Paragraph>
    <Paragraph position="11"> b ~. I propose doing a LEFT THORACOSTOMY.</Paragraph>
    <Paragraph position="12"> b'. I propose doing a LEFT thoracostomy.</Paragraph>
    <Paragraph position="13"> b ~&amp;quot;. I propose doing a left thoracostomy.</Paragraph>
    <Paragraph position="14"> The four accentual possibilities for the NP a left C/horacosC/omy in the second sentence are given in (2)b-b m. Examples (2)b and b ~ are both acceptable because they correctly accent the contrastive thoracostomy. Based on the the contents of the first sentence, however, the previous-mention strategy would produce the accentual pattern illustrated in (2)b&amp;quot;, which is clearly inappropriate. In fact, such an intonation may cause the hearer to infer that the program's objection was to performing the procedure on the wrong side. Finally, if one considers the terms left and thoracosC/omy to be given 1 The examples used throughout the paper are based on a the domain of TraurnAID, which is currently under development at the University of Pennsylvania (\[25\]). The morbid nature of the examples, for which we apologize, is due entirely to the special nature of the trauma domain. The lay reader may be interested to know that a thoracosC/omy is the insertion of a tube into the chest, and a thoracotomy is a surgical incision of the chest wall. In the examples, accented words are shown in small capitals.</Paragraph>
    <Paragraph position="15"> prior to the utterance because of their inclusion in the physician's plan, the previous-mention strategy would attempt to de-accent both terms as in (2)b~% Since the NP clearly requires some form of accentuation, alternative strategies are necessary in such a case. Other plausible previous-mention strategies exhibit similar problems for equally simple examples.</Paragraph>
    <Paragraph position="16"> We believe that some of the problems associated with the previot/s-mention strategy in meaning-to-speech systems can be rectified by explicitly modeling contrastive stress. For the esample above, the program initially knows that the physician's plan includes a left thoracotomy and that the program's plan includes a left thoracostomy. Hence, the program can construct an explicit set of alternative procedures from which accentual patterns can be determined. By noting that the alternatives differ not in the side on which they are to be performed, but in the actual type of procedure, the program can easily decide to stress thoracostomy rather than left. The precise algorithm for contrastive stress assignment is given a more detailed explanation in \[18\].</Paragraph>
    <Paragraph position="17"> We shall also see how the contrastive stress approach can avoid the over-accentuation problem of the previous-mention strategy as well. Consider a patient with two chest wounds: a right lateral wound and a right anterior wound. At some point our hypothetical system may need to address one of these wounds in the following manner. 2 (3) You need to address the right lateral chest wound in your treatment plan.</Paragraph>
    <Paragraph position="18"> Using the previous-mention strategy would lead to the following output if the wound had not been mentioned previously.</Paragraph>
    <Paragraph position="19"> (4) You need to address the RIGHT LATERAL CHEST WOUND in your treatment plan.</Paragraph>
    <Paragraph position="20"> The contrastive stress algorithm is able to recognize the crucial distinction between the lateral and anterior properties of the patient's two wounds and assign stress ac- null cordingly, producing: (5) You need to address the right LATERAL chest wound in your treatment plan.</Paragraph>
    <Paragraph position="21"> 3. The Implementation The present paper describes an implemented system (IBIS) that applies the CCG theory of prosody outlined 2 A closely related issue is how the system decides which moditiers are necessary in the description (\[20\]).</Paragraph>
    <Paragraph position="22">  in \[22, 17, 18\] to the the task of specifying contextually appropriate intonation for spoken messages concerning the medical expert system TraumAID, developed independently at Penn (cf. \[25\]). Our examples below are taker/from this domain, in which it is eventually our intention to deploy the generation system in a surgical situation in a critiquing mode, as an output device for the expert system. For the present purpose of illustrating the workings of the generation system, we have chosen a simpler (but sociologically rather unrealistic) database query application.</Paragraph>
    <Paragraph position="23"> The architecture of the system (shown in Figure 1) identifies the key modules of the system, their relationships to the database and the underlying grammar, and the dependencies among their inputs and outputs. The process begins with a fully segmented and prosodically annotated representation of a spoken query, as shown in  In example (6), capitals indicate stress and brackets informally indicate the intonational phrasing. The intonation contour is indicated more formally using a version of Pierrehumbert's notation (\[2\]). In this notation, L+H* and H* are different high pitch accents. LH% (and its relative LH$) and L (and its relatives LL% and LL$) are rising and low boundaries respectively. The difference between members of sets like L, LL% and LL$ boundaries embodies Pierrehumbert and Beckman's (\[2\]) distinction between intermediate phrase boundaries, intonational phrase boundaries, and utterance boundaries. 3We stress that we do not start with a speech wave, but a representation that one might obtain from a hypothetical system that translates such a wave into strings of words with Pierrehumbertstyle intonation markings.</Paragraph>
    <Paragraph position="24"> Since utterance boundaries always coincide with an intonational phrase boundary, this distinction is often left implicit in the literature, both being written with % boundaries. For purposes of synthesis, however, the distinction is important since utterance boundaries must be accompanied by a greater degree of lengthening and pausing.</Paragraph>
    <Paragraph position="25"> The intonational tunes L+H* LH(%/$) and H* L(L%/$) shown in example (6) convey two distinct kinds of discourse information. First, both H* and L+H* pitch accents mark the word that they occur on (or rather, some element of its interpretation) for focus, which in this task implies contrast of some kind. Second, the tunes as a whole mark the constituent that bears them (or rather, its interpretation) as having a particular function in the discourse. We have argued at length elsewhere that, at least in this restricted class of dialogues, the function of the L+H* LH% and L+H* LH$ tunes is to mark the theme - that is, &amp;quot;what the participants have agreed to talk about&amp;quot;. The H* L(L%/$) tune marks the theme that is, &amp;quot;what the speaker has to say&amp;quot; about the theme. We employ a simple bottom-up shift-reduce parser, makir/g direct use of the combinatory prosody theory described in \[22, 17, 18\], to identify the semantics of the question. The inclusion of prosodic categories in the grammar allows the parser to identify the information structure within the question as well, dividing it into theme and theme, and marking focused items with * as shown in (7). For the moment, unmarked themes are handled by taking the longest unmarked constituent permitted by the syntax.</Paragraph>
    <Paragraph position="27"> The content generation module, which has the task of determining the semantics and information structure of the response, relies on several simplifying assumptions.</Paragraph>
    <Paragraph position="28"> Foremost among these is the notion that the rheme of the question is the sole determinant of the theme of the response, including the specification of focus (although the type of pitch accent that eventually marks the focus will be different in the response). The overall semantic structure of the response can be determined by instantiating the variable in the lambda expression corresponding to the wh-question with a simple Prolog query. Given the syntactic and focus-marked semantic representation for the response, along with the syntactic and focus-marked semantic representation for the theme of the response, a representation for the rheme of the response can worked  out from the grammar rules. The assignment of focus for the rheme of the response (i.e. the instantiated variable) must be worked out from scratch, using techniques for assigning contrastive stress.</Paragraph>
    <Paragraph position="29"> The algorithm for assigning contrastive stress works as follows. For a given object x in the theme of the response, we associate a set of properties which are essential for constructing an expression that uniquely refers to x, as well as a set of objects (and their referring properties) 'which might be considered alternatives to z with respect to the database under consideration. The set of alternal;ives is initially restricted by properties or objects explicitly mentioned in the theme of the question. For each property of x in turn, we restrict the set of alternatives to include only those objects having the given property. When imposing the restriction decreases the number of alternatives, we conclude that the given prop-erty serves to distinguish x from its alternatives, suggesting that the corresponding linguistic material should be stressed.</Paragraph>
    <Paragraph position="30"> For example, for the question given in (6), the content generator produces the following representation, because the theme is &amp;quot;What urinalysis addresses&amp;quot;, the rheme is &amp;quot;hematuria&amp;quot;, and the context includes alternative conditions and treatments:</Paragraph>
    <Paragraph position="32"> From the output of the content generator, the ccG generation module produces a string of words and Pierrehumbert-stylemarkings representing the response, as shown in (9). 4 (9) urinalysis~lhstar addresses~lhb hematuria@hstarllb The final aspect of generation involves translating such a string into a form usable by a suitable speech synthesizer. The current implementation uses the Bell Laboratories TTS system \[14\] as a post-processor to synthesize the speech wave.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML