File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/w97-1202_metho.xml
Size: 14,126 bytes
Last Modified: 2025-10-06 14:14:51
<?xml version="1.0" standalone="yes"?> <Paper uid="W97-1202"> <Title>Message-to-Speech: high quality speech generation for messaging and dialogue systems</Title> <Section position="4" start_page="0" end_page="11" type="metho"> <SectionTitle> 2 Prosody Transplantation </SectionTitle> <Paragraph position="0"> The idea behind Prosody Transplantation is that of copying intonation and duration values from a recorded donor message (human speech) to the phonetic transcription of the same message. The En- null viched Phonetic Transcription (EPT) obtained in this manner can be fed to a TTS system whereby the normal linguistic and prosodic modules (based on general models) are by-passed (Phonetics-to-Speech -- PTS). Only the segmental synthesis and the synthesiser modules are used.</Paragraph> <Paragraph position="2"> sentence &quot;Thank you for your attention&quot; An example of an EPT is provided by figure 1.</Paragraph> <Paragraph position="3"> The first value between square brackets is the phoneme duration (in ms), optionally followed by one or more intonation breakpoints. Each breakpoint consists of a location value (in ms) relative to the beginning of the phoneme, followed by a pitch value (in ST/4; reference 50 Hz).</Paragraph> <Paragraph position="4"> A major asset, of Prosody Transplantation is the combination of natural sounding speech with a low bit. rate for storage (less than 300 bit per second). In addition, only the prosody and not the timbre of the speaker is retained. New donor messages can be recorded by new speakers and seamlessly integrated in existing applications. Specific tools have been developed to speed up the prosody transplantation process (Van Coile et al., 1994). Although the EPTs as such do not support linguistic variation, the combination of PTS with a template driven systern provides linguistic flexibility as well as natural prosody.</Paragraph> </Section> <Section position="5" start_page="11" end_page="13" type="metho"> <SectionTitle> 3 The Message-to-Speech System </SectionTitle> <Paragraph position="0"> The message-to-speech system described in this sect, ion takes as input a message specification and outputs synthetic speech with highly natural prosody.</Paragraph> <Paragraph position="1"> Below, we first define the key concepts and then focus on two main modules of the system: the generation module and the prosodic module.</Paragraph> <Section position="1" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 3.1 Key Concepts </SectionTitle> <Paragraph position="0"> A message can be seen as a complete sentence. It is specified as a concatenation of message units (building blocks that constitute prosodic units). The flexibility of a message unit is guaranteed by the presence of slots. A slot is a placeholder that can take an argument. A carrier is a template containing the enriched phonetic transcription of the canned text part, transplanted from an appropriate donor, and zero or more slots. For each slot, the carrier contains morpho-syntactic and prosodic information. By filling out a slot of a carrier with different arguments, several variants can be derived from the same message unit at run-time.</Paragraph> <Paragraph position="1"> Figure 2 shows the wave and the prosody corresponding to the donor &quot;in four miles&quot;. In order to obtain a flexible carrier, &quot;four&quot; is cut away and replaced by a slot in which any argument of the type /number/(see figure 3) can be filled out at run time.</Paragraph> <Paragraph position="2"> Figure 3 illustrates that the message &quot;In four miles, bear left.&quot; is realised as a concatenation of two message units: &quot;in/number/mile(s)&quot; and &quot;bear left&quot;. The message unit &quot;in/number/mile(s)&quot; has one slot in which a numeric argument is to be filled out. The message unit &quot;bear left&quot; has no slots.</Paragraph> <Paragraph position="3"> message In four miles, bear left units and carriers for a message</Paragraph> </Section> <Section position="2" start_page="11" end_page="12" type="sub_section"> <SectionTitle> 3.2 Message-to-Speech Generation Module </SectionTitle> <Paragraph position="0"> The generation module translates each message unit of the message specification into a carrier with optional arguments. This translation is guided by a two-fold mechanism: * argument dependent carrier selection consists in selecting a carrier in function of (a characteristic of) an argument. Figure 4 shows that the message unit &quot;in/number/mile(s)&quot; can be realised by one out of two carriers, depending on the numeric argument that is filled out. If the argument is &quot;1&quot;, the message unit is mapped on carrier la. In the other cases, the message unit is mapped on carrier lb.</Paragraph> <Paragraph position="1"> * carrier dependent argument realisation consists in determining the correct surface realisation of an argument, depending on properties of the slot in which it is inserted. Figure 5 illustrates that the argument 'T' has a different surface , , , s i i , i i , , , t , s i , |i il I I I I I i I ;l;i l&quot; i l; ; l&quot; l; l; i donor ..... ' ' ' tence</Paragraph> <Paragraph position="3"> realisation (&quot;a&quot; versus &quot;an&quot;) depending on the phonetic on-set of the word to the right of the slot.</Paragraph> <Paragraph position="4"> For the arguments filled out in the slot of a carrier, prosody is calculated at run-time (see section 3.3). As prosody derived from human recordings is preferred over prosody calculated at run-time, we try to keep the number of slots in a carrier as limited as possible. Therefore, the possibility is offered to delete arguments during the translation of message units into carriers. This functionality is exploited when the number of possible slot fillers is restricted. Figure 6 shows a message unit with one slot that is translated into one out of four carriers without slot, depending on the message unit argument.</Paragraph> </Section> <Section position="3" start_page="12" end_page="13" type="sub_section"> <SectionTitle> 3.3 Message-to-Speech Prosodic Integration Module </SectionTitle> <Paragraph position="0"> The purpose of the prosodic integration module is to calculate appropriate prosody for the arguments that are filled out in a slot of a carrier. Therefore, a phonetic transcription of the argument needs to be available. This transcription can be obtained by a dictionary look-up or by using a grapheme-to-phoneme conversion routine.</Paragraph> <Paragraph position="1"> In a first step a duration is calculated for each of the phonemes in the argument. In a second step, an * appropriate intonation contour is calculated.</Paragraph> <Paragraph position="2"> The input of the duration module is a phonetic transcription in which primary and secondary stress are indicated. The duration module has access to one or more duration models in order to produce a duration value for each phoneme in a phonetic transcription. null A duration model is a rule-based system calculating durations, taking into account parameters such as lexical stress, position of phonemes (word initial, word medial, word final, sentence final), length of the argument, phonetic context of phonemes (left/right neighbour, consonant cluster), etc. As speech rate can vary from one message to another, a slot specific speech rate coefficient, provided by the carrier, is also taken into account.</Paragraph> <Paragraph position="3"> Two major strategies with respect to duration modelling can be discriminated: * As the most natural prosody is the one derived from human speech, the possibility is offered to feed the duration module with phonetic transcriptions enriched with duration information copied from natural speech. When customising the MTS system, an argument dictionary containing this information can be built off-line by making use of the prosody transplantation tools (see section 2). If transplanted durations are available in the argument, they are taken over by the duration module and only modified in specific cases -- e.g. change a duration in order to cope with a phenomenon such as final lengthening.</Paragraph> <Paragraph position="4"> * For arguments without transplanted durations, a general purpose duration module is activated.</Paragraph> <Paragraph position="5"> It consists of a cascade of different duration rnodels each having a decreasing specificity.</Paragraph> <Paragraph position="6"> Specific duration models exist for particular arguments such as numbers or date and time indications. The general purpose model is only used if a more specific model is not available.</Paragraph> <Paragraph position="7"> Special tools have been developed to speed up the creation of general and special purpose duration models.</Paragraph> <Paragraph position="8"> The input of the intonation module is a phonetic transcription enriched with phoneme duration information. The output is a phonetic transcription describing both duration and intonation. After taking care of assimilation, this enriched phonetic transcription can be inserted without further action into the carrier.</Paragraph> <Paragraph position="9"> There are two ways to model the intonation on arguments: * The most natural intonation is obtained by transplanting part of an intonation contour as obverved in a donor sentence onto the argument that is to be filled out in a carrier. It is indeed possible to reuse the intonation as realised on &quot;4.6&quot; in the donor phrase &quot;in 4.6 miles&quot; for the argument &quot;9.5&quot; that is to be inserted in the carrier &quot;in/number/miles&quot;.</Paragraph> <Paragraph position="10"> * If no appropriate donor contour is available, the intonation module calculates a piecewise linear intonation contour based on slot specific intonation models. Slot specific intonation parameters that are taken into account are among others the begin pitch, the end pitch, the declination rate and the intonation context (final fall, continuation rise, etc.).</Paragraph> </Section> </Section> <Section position="6" start_page="13" end_page="14" type="metho"> <SectionTitle> 4 Discussion </SectionTitle> <Paragraph position="0"> The Message-to-Speech system is designed to generate high quality speech output with the flexibility desired for spoken dialogue and message generating systems. It produces high quality speech while morpho-syntactic variations are taken into account.</Paragraph> <Paragraph position="1"> More specifically, as the message units and underlying carriers can take arguments, it is possible to generate several variants of the same basic message.</Paragraph> <Paragraph position="2"> * variations on the level of a carrier slot can be paradigmatic: a message ranges over all the elements belonging to a certain semantic category (e.g. product name, cardinality, direction - see figure 6) but the actual message is not known on beforehand.</Paragraph> <Paragraph position="3"> * variations on the level of a carrier slot can be syntagmatic: agreement of all kinds, liaison, contraction, etcetera (see figures 4 & 5).</Paragraph> <Paragraph position="4"> * variations on the level of the message units can be semantic: new combinations of message units lead to the creation of new messages. E.g. the message unit &quot;in /number/ mile(s)&quot; can not only be combined with a message unit &quot;drive /slowly_fastp but also with the message unit &quot;bear fleft_rightp.</Paragraph> <Paragraph position="5"> Highly natural prosody for the carriers is obtained thanks to the prosody transplantation technique.</Paragraph> <Paragraph position="6"> The prosody transplantation technique can be used for the slot arguments as well. However, if no donor prosody is available for an argument, prosody is calculated at run-time on the basis of specific duration and intonation models.</Paragraph> </Section> <Section position="7" start_page="14" end_page="14" type="metho"> <SectionTitle> 5 Related Research </SectionTitle> <Paragraph position="0"> In what follows we try to relate the MTS system to the levels that are generally recognised to form part of a NLG system. A well known architectural scheme outlining the three basic levels of an NLG system has been proposed by Reiter (Reiter, 1994, p.164) 1: 1. content determination and text planning: The content of the message to be communicated is mapped onto a semantic form, possibly annotated with rhetorical relations. On this level, reasoning takes place about the communicative goals of the text or message and the rhetorical relations between these goals.</Paragraph> <Paragraph position="1"> 2. sentence planning: The information of the semantic form is distributed over sentences and paragraphs. The sentences are linked together. 3. surface generation: The abstract specification of the linguistic structure is mapped to a surface form that communicates the information while syntactic and morphologic processing is done in order to generate a grammatically correct surface form.</Paragraph> <Paragraph position="2"> If we compare our strategy with the classification proposed above, the mapping of a message unit onto carriers is to be situated on the surface generation level. The result of the mapping stage is a complete surface form (represented by an EPT): the precise wording of a message has been fixed in accordance with syntactic and morphologic restrictions. The prosodic integration phase has no explicit place in Reiter's architecture since he only studied text generation systems.</Paragraph> <Paragraph position="3"> The content determination, text planning, and sentence planning levels are not provided by the MTS system. In a number of practical message generating systems, the content of a message corresponds in a straightforward manner with the message units, which can therefore easily be generated by the back-end application.</Paragraph> </Section> class="xml-element"></Paper>