File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/w98-1436_metho.xml

Size: 10,925 bytes

Last Modified: 2025-10-06 14:15:21

<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-1436">
  <Title>ROMVOX- EXPERIMENTS REGARDING UNRESTRICTED TEXT- TO-SPEECH SYNTHESIS FOR THE ROMANLAN LANGUAGE</Title>
  <Section position="1" start_page="0" end_page="0" type="metho">
    <SectionTitle>
ROMVOX- EXPERIMENTS REGARDING UNRESTRICTED TEXT-
TO-SPEECH SYNTHESIS FOR THE ROMANLAN LANGUAGE
ATTILA FERENCZ*, TEODORA RATIU* , MARIA FERENCZ*,
TONDE-CSILLA KOVACS*, ISTVAN NAGY*, DIANA ZAIU**
* Software 1TC, 109 Republicii street, 3400 Cluj-Napoca, Romania,
</SectionTitle>
    <Paragraph position="0"> tel: +40-64-197681, fax: +40-64-196787, e-mail: Attila.Ferencz@sitcl.dntcj.ro ** Technical University of Cluj-Napoca, 26 Gh. Baritiu street, 3400 Cluj-Napoca, Romania Abstract. The ROMVOX Text-toaSpeech synthesis system developed by our team is the first one that allowed the synthesis of any unrestricted Romanian text with intonation facilities on 1BM-PC compatible computers. During the last years of research several version of text-to-speech systems were achieved, trying to enhance their facilities. Our paper describes the present stage of our experiments performed in order to improve the naturalness of the generated voice.</Paragraph>
  </Section>
  <Section position="2" start_page="0" end_page="304" type="metho">
    <SectionTitle>
1. Introduction
</SectionTitle>
    <Paragraph position="0"> Speech synthesis systems are expected to play important roles in advanced user-friendly human-machine interfaces. Wishing to realize an as good as possible text-to-speech system for the Romanian language the research started with the *development of the software for monotonous speech synthesis, which simply concatenated the elements of the speech database. Prosodic aspects need to impose a correspondent modification of the synthesized speech signal, modification performed in the second version based on LPC. The experimental results using the classical LPC synthesis method proved that the quality of the synthesized signal is limited and it cannot be considerably improved by rising the prediction order, the sampling frequency or the parameters' refreshing frequency. The following chapters present the language specific aspects of the ROMVOX system and our last approach regarding the used synthesis technique.</Paragraph>
    <Paragraph position="1">  the incoming orthography into some linguistically reasonable s~andard forms. There are many phenomena encountered in normal orthography like: underlining, the occurrence of capitals, abbreviation containing periods, abbreviations containing no vowels, numbers, fractions, Roman numerals, dates, times, formulas and a wide variety, of punctuation including periods, commas, question marks, parentheses, quotation marks and hyphens. In our system the abbreviations are stored in a vocabulary, which can be extended by the user, so field specific abbreviation can be built into the system.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Speech sound set. We used a set of 31 phonemes for Romanian language. As internal representation for
</SectionTitle>
      <Paragraph position="0"> special Romanian sounds we used the following symbols: gl (in ge, gi), g (in ghe, ghi), c (in ce,ei), k (in the, ehi),al (for Romanian letter ~), il (for7 and $) sl (for ~), tl (for t).</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="304" type="sub_section">
      <SectionTitle>
2.3 Conversion of graphemes. In our system the grapheme-to-phoneme conversion rules are alphabetized
</SectionTitle>
      <Paragraph position="0"> according to the first letter of the sequence. Each letter of the alphabet represents a separate rule block in the</Paragraph>
      <Paragraph position="2"> table. One such block has the longest rule at the top and the shortest rule at the bottom; i.el the last rule consists of only one letter.</Paragraph>
      <Paragraph position="3"> Examples: The e sound rule block the e sound block rule As result of the grapheme-to:phoneme conversion eslti=_jlesltjl._ coop=ko_op algorithm, the desired string of diphones is este=_jleste_ cea=ca obtained. For example, the string corresponding to exa=egza eio=co the word 'floare' is: _ffl 1o oa ar re e_.</Paragraph>
      <Paragraph position="4"> eio=ej lo chi=ki .</Paragraph>
      <Paragraph position="5"> ca= jla_. che=ke In future versions of ROMVOX, a second level el=__yjlel_ ci=ci processing of sound codes will be experimented. ei=ej 1_ ce=ce So, timing modifications could be made according e=e c=k to the rules of the prosody preparation module.</Paragraph>
      <Paragraph position="6"> Where _ means pause, j I means special short i.</Paragraph>
      <Paragraph position="7"> 2.4 Word accent. For Romanian language the word accent is free, choosing between the last two syllables of the word, and there are many words with other place of accent. Semantically different words have the same orthography. For example: cfirele (cure -plural) currle (belt -plural) vrsel~ (gay -feminine, plural) vesrl~ (dishes) We are thinking of the possibility to formalize these kinds of problems. 2.5 Intonation. For obtaining acceptable intonation for unrestricted texts, a set of rules has to be formulated which produces natural sounding pitch contours for utterances that may have never been spoken. In sentence intonation, one serious problem is to find such rules that make the monotonous speech more natural, so that listening to long texts would not be uncomfortable. We studied experimentally the pitch contour for different kinds of sentences (declaratives, questions, and exclamations). For declarative sentences, the fundamental fi-equency raises for the first word (from 100% to 140% of its value and slows down to 125% for the last part of this word), and slows down until the end of the sentence, except the last word: Here it falls at 70% and remains Constant.</Paragraph>
      <Paragraph position="8"> Questions can be with Q-word (specific word for interrogation) or without. For the former, the fundamental frequency raises on this word from 100% to 160% and comes down to 100%. For the last type of questions we adopted a conventional pitch contour, but very subtle intonation effects cannot be handled.</Paragraph>
    </Section>
  </Section>
  <Section position="3" start_page="304" end_page="306" type="metho">
    <SectionTitle>
3. Signal processing
</SectionTitle>
    <Paragraph position="0"> Our last experiments in order to improve the quality of the synthesized signal are based on a hybrid timedomain-LPC approach. This approach takes into consideration the behavior of the glottal pulse (for voiced sounds) which can be described using the Liljencrants-Fant (LF) model, \[Veldhuis 96\].</Paragraph>
    <Paragraph position="1"> Figure 1.a. presents the time domain waveforms of the Romanian vowel o, the corespondent source signal (Figure 1.b.). As it can be seen, during the opened phase of the glottis in which the source signal contains values which are different from zero (also positive and negative values), the source signal assures the excitation of the filter, resulting a generated waveform which depends on the resonance characteristics of  o, uttered by a male (3 pitch periods) the vocal tract. During the closed phase of the glottis (no pressure wave) the vocal tract respectively the filter doesn't get energy anymore, so the generated waveform results in this phase as combination of damped oscillations.</Paragraph>
    <Paragraph position="2"> If the source signal would consist of a single opened phase of the glottis followed by a long closed phase, the generated waveform would be damped, ending with no oscillations. Because in reality the next opened phase follows immediately after a relatively short previous closed phase, the generated waveform will contain the effects of both the effects of the previous state and the effect of the new excitation. Taking into account that the above model is a linear model, the two effects are combined by simple addition, in concordance with the theorem of superposition. This is equivalent to considering that the source signal consists of a few individual signals (waveforms c, d, and e) corresponding each to an individual opened-closed phase of the glottis, and each such individual source Signal will excite the filter resulting also individual output signals (waveforms f, g, and h).</Paragraph>
    <Paragraph position="3"> From the superposition of these output signals results the initial, whole output signal. The waveforms presented in Figure I. present such a case for three pitch periods.</Paragraph>
    <Paragraph position="4"> Pitch modification means the modification of the distances between two consecutive opened-closed cycles, in which the effect of the previous cycle will be combined with the effect of the new excitation in a different manner but exactly in concordance with the theorem of superposition. This means that it is necessary (in a previous analysis phase) to decompose the original signal in pitch-synchronous individual signals as those presented in Figure 1., signals f, g, h. In the synthesis phase we have only to superimpose this individual signalsat new distances in concordance with the desired new pitch.</Paragraph>
    <Paragraph position="5">  individual pitch-synchronous signal is used to generate a longer output signal with modified pitch. The signal starts with a lower fundamental frequency (one octave lower), which increases to the initial value of the pitch (at the middle of the signal), continuing to increase to higher values (one octave higher).</Paragraph>
    <Paragraph position="6"> The main problem is the decomposition of the initial signal into individual, pith-synchronous signalS. This implies two aspects. First of all it is necessary to determine the evolution of damped regime for each individual signal. As presented before this damped signal is due to the accumulated energy in the filter, and is determined by the resonance characteristics of the filter. We used the LPC analysis method, which is one of the most used methods for the determination of the filter characteristics of the vocal tract. If the parameters of the filter are determined and if the filter is placed in the initial state from the beginning of the closed phase of the glottis, it will generate automatically the desired damped signal which can last  over 2-3 other pitch periods. The other task is to eliminate the effect of this new determined damped signal upon the next pitch-synchronous individual signal (signals). This can be done by simply subtracting the current determined individual signal from the initial one.</Paragraph>
    <Paragraph position="7"> These two operations will be performed consecutively for the whole signal, and each intermediate individual signal is saved in a database (sound inventory). Because the sound inventory contains diphones, the above procedure must be applied for each diphone.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML