File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/88/a88-1016_metho.xml
Size: 18,779 bytes
Last Modified: 2025-10-06 14:12:07
<?xml version="1.0" standalone="yes"?> <Paper uid="A88-1016"> <Title>I \] \[ Generation of LPC frames \] \['Advertising-Style ProsodyJ I LPC Frames</Title> <Section position="2" start_page="0" end_page="115" type="metho"> <SectionTitle> 1. PRESENTATION OF MULTIVOC </SectionTitle> <Paragraph position="0"> The text-to-speech MULTIVOC system is the result of a technology transfer from a research institute (CNET Lannion, France), which developed the basis of the system, to an industrial company (Cap Sogeti Innovation, France) which made the system a commercial product. Generating Linear Prediction Coding frames from ordinary text written in French, the goal of MULTIVOC is to give any standard applications the ability to produce (in real time) low-cost and high-quality speech output.</Paragraph> <Paragraph position="1"> MULTIVOC is shipped as a complete software system which aims to provide a sophisticated driver enabling applications to directly send French spoken text. The software package consists of the kernel of the driver itself and a set of dictionaries used by it. Several tools in the package allow an advanced user to tailor his own MULTIVOC driver to specific usage. Beside this static configuration facility, MULTIVOC also provides several run-time features. By submitting specific requests an application can change the following parameters: * The sampling frequency for generated frames.</Paragraph> <Paragraph position="2"> Three different frequencies are available: 8 kHz, 10 kHk and 16 kHz. This parameter will characterize the quality of the output voice, a frequency of 16 kHz providing the best results.</Paragraph> <Paragraph position="3"> * The tone of the output voice can be adjusted in the range 50-350 Hz.</Paragraph> <Paragraph position="4"> * The speech speed may be set from 1 to 10 syllables per second.</Paragraph> <Paragraph position="5"> * Two styles of prosody are provided. The &quot;reading-style&quot; corresponds to the usual way of reading a text, while the &quot;advertising-style&quot; is dedicated to short commercial messages like jingles. * One can also choose between a female or a male voice.</Paragraph> <Paragraph position="6"> The method used for the synthesis produces Linear Prediction Coding (LPC) frames generated from a diphone dictionary. Such a dictionary is specific to the sampling frequency used (8, 10 or 16kHz) and also to the style of voice (Female or Male). For this purpose, MULTIVOC provides 6 differents diphone dictionaries.</Paragraph> <Paragraph position="7"> The overall processing is organized as a pipelined set of transformations applied to the input text. At the higher level, one can distinguish the following functions: The pre-processing (or lexical processing) is a text-to-text transformation aiming to expande some non-worded terms like numbers (1987 --> &quot;Mille Neuf Cent Quatre-Vingt-Sept&quot;), administrative numbers (A4/B5 --> &quot;A Quatre B Cinq&quot;) or acronyms (CSINN. --> &quot;Cap Sogeti Innovation&quot;). The phonetization process transforms the pre-processed text into phonemes according to pre-defined rules stored in a user-modifiable base. The prosody marking process scans the phonetized text and generates appropriate marks to reflect the prosody of the text using built-in rules based on the different punctuation signs and the grammatical type of words.</Paragraph> <Paragraph position="8"> The rhythm marking process computes the duration associated to each phoneme.</Paragraph> <Paragraph position="9"> Last, the frame generation process produces the LPC frames which correspond to the input text according to the different parameters specified and can be sent directly to the output device.</Paragraph> <Paragraph position="10"> In this overall processing, we have deliberately avoided a time-consuming syntax analysis, to enable MULTIVOC to run in real time. This choice has made MULTIVOC a commercially viable product providing a high-quality speech at low cost and which has been sold to serve as a basic component for several industrial applications. MULTIVOC is available on IBM-PC based systems.</Paragraph> </Section> <Section position="3" start_page="115" end_page="115" type="metho"> <SectionTitle> 2. THE MULTIVOC PROCESS </SectionTitle> <Paragraph position="0"> As explained in the previous section, the input text provided by an application is processed in a &quot;pipe-line&quot; through five processes (see figure 1). Each process takes as input the result of the preceding one and fills specific attributes of the objects composing the internal representation of the text. The final result, a list of LPC frames, is then sent to the LPC interpreter of a speech synthesis device (not described here).</Paragraph> <Paragraph position="2"> The main purpose of this first step is to decompose the input sentences into a list of ~ords and to set the lexical attributes of each word. In order to allow ordinary-written text to be correctly processed, some patterns are translated into a sequence of words: * numbers are expansed according to the French language rules. The words generated are tagoed to permit a correct prosody marking for numbers.</Paragraph> <Paragraph position="3"> ,, digital dates, time templates (not exhaustive) are matched against corresponding patterns in a set of rules which define the transformation to be applied. Patterns corresponding to the matching part and the transformation format are expressed using a UNIX- null scanf/printf-like syntax.</Paragraph> <Paragraph position="4"> * abbreviations and acronyms are translated according to a user-defined lexicon. The translation part associated each entry of the lexicon can be: - empty to specify that the recognized word is to be spelled ex: 'MIT.' --> . (which will produce 'M I T' \[EM EE TAY in French\]) - a full text string which will replace the matching word ex: 'MIT.' --> 'Massachusetts Institute of Technology' (in French!...) - a phonetic string if the pronunciation is very different from the lexical form. This function is particularly useful for company or product names ex: 'MIT.'--> 'AI&quot;MAYTI'. (better) * mathematic symbols are also translated The process then checks if each word can be pronounced, according to a dictionary of the French sequences of pronounceable letters, and if it cannot the word is spelled.</Paragraph> <Paragraph position="5"> Finally, an attribute is associated to each word describing the grammatical nature of the word (pronoun, determin, preposition .... ). This dictionary is rather small (300 entries) and does not contain most verbs but does contain the usual auxiliaries.</Paragraph> <Paragraph position="6"> A complete analysis of the sentences would provide a better prosody but, due to the size of the corresponding dictionary, could not be processed in real-time. The resulting prosody is nevertheless judged very natural, albeit in some few cases somewhat strange.</Paragraph> <Paragraph position="7"> <LC> and <RC> are the respective Left and Right contexts of the Matching Sequence <PS> is the sequence of Phonetic Symbols to be generated and has the meaning: &quot;Replace <MS> by <PS> if <MS> is preceded by <LC> and followed by <RC>.</Paragraph> <Paragraph position="8"> Each context specification (<LC> and <RC>) can be empty, in which case the rule is applicable with no conditions, or can be expressed as a logical combination of elementary context:</Paragraph> <Paragraph position="10"> An elementary context is either a sequence of characters or a class of sequence of characters (e.g.</Paragraph> <Paragraph position="11"> consonants or vowels).</Paragraph> <Paragraph position="12"> During interpretation, if several rules are applicable, the one containing the longest Matching Sequence is chosen: thus, the interpreter goes from the particular case to the general case. If more than one rule satisfies this criterion the first one is chosen and if no rule is applicable, a character is popped from the input and pushed to the output before the process start again.</Paragraph> <Paragraph position="13"> - the character ' ' (underscore) denotes a blank character - the character T denotes the logical operator OR - the character '&' denotes the logical operator AND One of the set of rules is dedicated to the determination of the correct liaisons between words.</Paragraph> </Section> <Section position="4" start_page="115" end_page="115" type="metho"> <SectionTitle> * PHONETIZATION </SectionTitle> <Paragraph position="0"> This process transforms the sentences into a sequence of phonetic symbols. This transformation is carried out by five set of rules. The sets are applied successively to the input text.</Paragraph> <Paragraph position="1"> Each rule has the following form: \[<LC>\] <MS> \[<RC>\] --> <PS>.</Paragraph> <Paragraph position="2"> where <MS> is the Matching Sequence of characters in the input text</Paragraph> </Section> <Section position="5" start_page="115" end_page="117" type="metho"> <SectionTitle> * PROSODY MARKING </SectionTitle> <Paragraph position="0"> The synthetic speech produced by mere concatenation of diphones is comprehensible but not very natural. To provide it with an acceptable quality, it is necessary to operate a prosody processing. Prosody facts are of two kinds (Emerard, 1977), (Guidini, 1981), (Sorin, 1984): * macro-prosody, related to the syntactic and semantic structure of the sentence, * micro-prosody, treating the interaction between two consecutive phonemes.</Paragraph> <Paragraph position="1"> A study of a set of phrases and the diversity of the voice &quot;styles&quot; (reading, advertising .... ) has provided an automatic prosody generation system (Aggoun, 1987). In the first step, this process decomposes the sentences in a set of so-called prosody-groups, and associates to each of them a group category. In the second step, each word within a group is marked and a pause is associated with it.</Paragraph> <Paragraph position="2"> A prosody-group is by consecutive words. A set of rules determines the boundaries of a group and its associated category. The main criteria involved in this decomposition are: * the punctuation marks (including the end of a sentence), each of them defining a different category * the grammatical natures of two consecutive words. For example, a group ends after a lexical word (noun, non-auxiliary verbal form) followed by a grammatical word (determinant, pre-position .... ). In that case, the category of the group depends on the second word.</Paragraph> <Paragraph position="3"> The resulting sequence of groups is then processed in order to adjust their categories. Here again, the process is governed by rules based on the following information: * the length of the group (the number of words it contains), * the number of syllables of each word within the group, * the number and the length of non-lexical words, * the category of the adjacent groups As an example of rule: IF there exist a sequence (S) containing 3 groups of category '5' without a pause already established for one of them, AND if one of them (G) begins with one of the following determinant ('AU' or 'AUX') THEN give a category '4' to G and give it a short pause except if its pause is already long.</Paragraph> <Paragraph position="4"> For instance, 50 rules of this kind allow a complete categorization of the groups.</Paragraph> <Paragraph position="5"> \[Note: some of them are simpler !\]</Paragraph> <Section position="1" start_page="117" end_page="117" type="sub_section"> <SectionTitle> Word Marking </SectionTitle> <Paragraph position="0"> According to the category of the group it belongs to, its length, its grammatical nature, each word of a group is then marked and, possibly, a pause is placed at the end of the word.</Paragraph> <Paragraph position="1"> For example: IF the group contains exactly 2 non-lexical consecutive words, AND the first one has one syllable AND the second more than one, THEN give the first word the mark '6+' and give the second the mark '4-' It should be noted that the set of rules used depends on the style of prosody required by the application ('reading' or 'advertising').</Paragraph> <Paragraph position="2"> Although some attempts have been made to express the prosody-marking rules in a declarative way (Sorin, 1984), (Aggoun, 1987), based on the logic paradigm, the efficiency criteria and the real-time objective we have defined for this product led us to represent them in a procedural way rather than in a production-Srule form.</Paragraph> <Paragraph position="3"> At the end of this process, some words remain unmarked. In the next processes, we consider a sequence of unmarked word terminated by a marked one (a prosody-word) as the basic entity to deal with.</Paragraph> </Section> </Section> <Section position="6" start_page="117" end_page="118" type="metho"> <SectionTitle> * RHYTHM COMPUTATION </SectionTitle> <Paragraph position="0"> The third process involved in MULTIVOC consists in the computation of the duration to associate to each phoneme. This duration is computed according to the different attributes attached to each word and to each phoneme, which are: * the kind of phoneme (plosive \[bang\], fricative \[french\], liquid \[long\]), * the mark associated the word * the number of syllabin of the word * the position of the phoneme within the word and a set of rules using this information. As an example of such rules: IF the last phoneme of the word is a vowel AND the mark of tire word is '5' OR if a pause is associated with the word, THEN give a duration of '1.4' to this phoneme \[Note: the default duration of every phoneme is '1.0' \]</Paragraph> </Section> <Section position="7" start_page="118" end_page="118" type="metho"> <SectionTitle> * PROSODY GENERATION </SectionTitle> <Paragraph position="0"> To every word-mark corresponds a macro-melody schema. This schema enables us to determine the variation of the pitch along the word.</Paragraph> <Paragraph position="1"> Three basic functions are used to express the pitch variation: * constant: the pitch remains unchanged * linear interpolation * exponential variation, namely F(t) = F(to) * e -p(t -tO) where F(t) denotes the value of the pitch at the time 't', tO is the initial time and p is a constant (p = 0.68) Every macro-melody schema begins at Fdeb, the fundamental frequency of the speaker. Fde b is set to 240 Hz for a Female voice and 120 Hz for a Male voice. This fundamental is adjusted if the word has a micro-mark '+' or '-'.</Paragraph> <Paragraph position="2"> Then a set of rules determines when these functions should be applied to a word.</Paragraph> <Paragraph position="3"> As an example: For words with mark '1' and containing more than four syllables: - apply constant from the beginning until the middle of the second vowel, - apply exponential with p/2 until the beginning of the first 'voise' phoneme of the last syllable (point A), - apply constant Fdeb/2 from the end of the last vowel (point B) to the end of the word, - interpolate from A to B Then a set of micro-prosody rules is applied on the vowels ('fine tuning').</Paragraph> <Paragraph position="4"> Example: IF a vowel is not in the last syllable of a word AND followed by an unvoiced consonant THEN the pitch of the last LPC frames of the vowel is adjusted in the following manner:</Paragraph> <Paragraph position="6"> At these step in the process, all needed information has been computed (pitch, duration) and MULTIVOC generates an LPC structure after having accessed a dictionary of diphones to get the coefficient of the lattice filter for each phoneme.</Paragraph> </Section> <Section position="8" start_page="118" end_page="118" type="metho"> <SectionTitle> 3. IMPLEMENTATION OF MULTIVOC </SectionTitle> <Paragraph position="0"> The MULTIVOC software was developed in C on MS-DOS 3.2 and is compatible with UNIX BSD 4.2. This product is sold either as a running package (binary form) for IBM-PC compatible computers or as an adaptable package (source form) for specific usage.</Paragraph> <Paragraph position="1"> On the IBM-PC, the speech synthesis device used comes from the OROS Company (France) and is featured as an IBM-PC pluggable board (OROS-AU20) based on a Texas Instruments TMS320/20 processor. Tile MULTIVOC driver is implemented as a memory-resident program which application can address using an interrupt mechanism. Doing this, any application can very easily send text to be pronounced in real time.</Paragraph> <Paragraph position="2"> A Microsoft Windows application has been developed to demonstrate the facilities offered by MULTIVOC. Users can enter text using a built-in editor and can send all or mouse-selected text to MULTIVOC. A form (Dialogue-Box) allows the different parameters of MULTIVOC to be set to user specified values.</Paragraph> <Paragraph position="3"> MULTIVOC has also been successfully ported to UNIX BSD 4.2 on a SUN-3 but the driver specific aspects have not yet been developed because of tile lack of speech synthesis devices for such machines.</Paragraph> </Section> <Section position="9" start_page="118" end_page="119" type="metho"> <SectionTitle> 4. APPLICATIONS OF MULTIVOC </SectionTitle> <Paragraph position="0"> We give below three examples of concrete and real-world applications of MULTIVOC in an industrial context: (r) The first one was to use MULTIVOC to pronounce TELEX-style messages. This has been realized by defining an appropriate lexicon for the numerous abbreviations and acronyms used in such messages. The sources of MULTIVOC have not been modified.</Paragraph> <Paragraph position="1"> * The second application, or class of application, is to adapt MULTIVOC to low cost and small homecomputers to devek)p a new generation of product for this market (Computer aided education software, for example). This is conducted by two customers who bought the sources of MULTIVOC and are now producing a restricted version of the product. * The third application is to use MULTIVOC as a basic component in a sophisticated application. We are now running a project for the French Telecommunications (DGT) to develop phone-based mall services. Using a standard French phone, any user will be able to call the mailing service and dial commands to hear the different messages he has received. Several user-friendly features will enable to hear again part or all of a message or to change MULTIVOC-Iike parameters (deeper voice, slower, ...). For the purposes of this project MULTIVOC will not be changed.</Paragraph> </Section> class="xml-element"></Paper>