XML Viewer - w98-0804

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/98/w98-0804_abstr.xml
Size: 17,700 bytes
Last Modified: 2025-10-06 13:49:32
<?xml version="1.0" standalone="yes"?>
<Paper uid="W98-0804">
  <Title>Grapheme-to-phoneme transcription rules for Spanish, with application to automatic speech recognition and synthesis</Title>
  <Section position="2" start_page="0" end_page="36" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> Large phonetic corpora including both standard and variant transcriptions are available for many languages. However, applications requiring the use of dynamic vocabularies make necessary to transcribe words not present in the dictionary. Also, additional alternative pronunciations to standard forms have shown to improve recognition accuracy. Therefore, new techniques to automatically generate variants in pronunciations have been investigated and proven to be very effective. However, rule-based systems still remain useful to generate standard transcriptions not previously available or to build new corpora, oriented chiefly to synthesis applications.</Paragraph>
    <Paragraph position="1"> The present paper describes a letter-to-phone conversion system for Spanish designed to supply transcriptions to the flexible vocabulary speech recogniser and to the synthesiser, both developed at CSELT (Centro Studi e Laboratori relecomunicazioni), Turin, Italy. Different sets of rules are designed for the two applications. Symbols inventories also differ, although the IPA alphabet is the reference system for both. Rules have been written in ANSI C and implemented on DOS and Windows 95 and can be selectively applied. Two speech corpora have been transcribed by means of these grapheme-to-phoneme conversion rules: a) the SpeechDat Spanish corpus which includes 4444 words extracted from the phonetically balanced sentences of the database b) a corpus designed to train an automatic aligner to segment units for synthesis, composed of 303 sentences (3240 words) and 338 isolated words; rule-based transcriptions of this corpus were manually corrected.</Paragraph>
    <Paragraph position="2"> The phonetic forms obtained by the rules matched satisfactorily the reference transcriptions: most mistakes on the first corpus were caused by the presence of secondary stresses in the SpeechDat transcriptions, which were not assigned by the rules, whereas errors on the synthesis corpus appeared mostly on hiatuses and on words of foreign origin.</Paragraph>
    <Paragraph position="3"> Further developments oriented to recognition can imply addition of rules to account for Latin American pronunciations (especially Mexican, Argentinian and Paraguayan); for synthesis, on the other hand, rules to represent coarticulatory phenomena at word boundaries can be implemented, in order to transcribe whole sentences.</Paragraph>
    <Paragraph position="4">  Introduction Grapheme-to-phoneme conversion is an important prerequisite for many applicalions involving speech synthesis and recognition \[I\].</Paragraph>
    <Paragraph position="5"> Large corpora used for these applications (e.g.</Paragraph>
    <Paragraph position="6"> WSJ, CMU, Oxford Pronunciation Dictionary, ONOMASTICA, SpeechDat) include phonetic transcriptions for both standard pronunciations and for variants, which can represent either differences in dialectal or individual realisation of single words (intra-word variants) \[2, 3\] or variations in the standard form produced by coarticulation between words (inter-word variants) \[4\].</Paragraph>
    <Paragraph position="7"> These alternative pronuciations have been shown to improve recognition accuracy \[5\] and they need to be present in large phonetic database: the variants can either be realised manually on the basis of expert phonetic knowledge, or by a rule-based system. However, maintenace of such systems is complex, because insertion of new rules often causes to change the overall performance of the module.</Paragraph>
    <Paragraph position="8"> Therefore, new techniques to derive automatically rules for vapheme-to-phoneme conversion from training data have been investigated. Generally rules are obtained through forced recognition, according to the following procedure: 1) aligning of the canonical pronunciation to the alternative ones by means of a dynamic programming algorithm, in order to generate an aligned database 2) use this database to train a statistical model or a binary decision tree to generate variants of words or proper names \[1\] \[3\] \[6\] \[5\] or to model context-dependent variations at word boundary \[4\]; neural networks can also be used to generate variants in pronunciation of words \[2\] or of surnames \[7\], on the basis of pre-aligned or non-aligned training data \[8\]. Finally, a mixed approach combining knowledge obtained fi'om training data and a priori phonetic expertise has also been experimented to derive possible non-native pronunciations of English and Italian words \[9\].</Paragraph>
    <Paragraph position="9"> All these techniques have proven to be very effective to generate plausible alternatives to canonical ones. However, rule-based approaches can still represent an effective tool to automatically obtain standard transcriptions of large corpora built ad hoc for special applications, in particular oriented to synthesis: a letter-to-phone rules component is very suitable to represent allophonic and allomorphic variations \[10\] \[11\] \[12\] which are essential to allow segmentation and diphone extraction from an acoustic database \[ 13\].</Paragraph>
    <Paragraph position="10"> The rule system described in the present paper was developed on the basis of phonetic knowledge \[14\] \[15\] and has two different application domains, which imply different transcription requirements: the recogniser for Spanish uses sub-word units \[16\] \[17\] linked to the phonetic representation of isolated words; the units have been trained on the corpus of words extracted from the phonetically balanced sentences included in the SpeechDat database.</Paragraph>
    <Paragraph position="11"> Therefore, the SpeechDat corpus has been considered as the reference set of words that the conversion rules minimally had to correctly transcribe. Only isolated words were used, with the same phoneme inventory employed in the original SpeechDat transcriptions, including no allophones.</Paragraph>
    <Paragraph position="12"> On the other hand, the corpus for synthesis was selected to collect speech material to train the automatic phonetic aligner, in order to extract diphones for a concatenative synthesis system \[18\] and had to meet different requirements: a) units were to be pronounced both in isolated words and in sentences b) the phoneme inventory had to include the maximum number of allophones so to allow to build a representative acoustic dictionary containing occurences of all units and sequences in every appropriate segmental and prosodic context (stressed and unstressed syllables; initial and final position in the syllable; initial, internal and final position in the sentence; short and long sentences).</Paragraph>
    <Paragraph position="13"> Therefore, two partially different sets of rules have been designed for synthesis and recognition, which can be alternatively activated: the latter are a subset of the former. Both systems provide only  one variant in output, i.e. the standard Castilian (Madrid) Spanish pronunciation.</Paragraph>
    <Paragraph position="14"> 1. Orthographic and phonetic symbols  The orthographic string in input is preprocessed to avoid problems relative to the configuration of the operating system: at this stage, letters with diacritics (corresponding to non-standard ASCII characters) can either be represented by means of extended ASCII (e.g. 'fi' = ext. ASCII code 241) or they can be converted into a sequence of standard ASCII symbols ('n ~' = st. ASCII 110+126). This preprocessing is common to both the recognition and the synthesis systems.</Paragraph>
    <Paragraph position="15"> The phonetic symbols used for recognition represent the 30 basic Spanish phones, which are also common to synthesis. For this latter system, however, 11 extra symbols have also been added to represent a set of allophones of the standard Spanish which show an acoustic structure clearly differentiated from the rest of already included allophones. These symbols represent stressed vowels, semivowels (i.e. \[i\] and \[u\] allophones in second position of falling diphthongs, distinguished from semi-consonants, or glides, i.</Paragraph>
    <Paragraph position="16"> e. \[i\] and \[u\] allophones in first position of rising diphthongs), the nasal labiodental allophone \[n\]\], the palatal voiced stop ~\], which accounts for the distribution of the palatal voiced approximant \[j\], in initial position or after T or 'n'(e.g. 'conyuge' = \[k &amp;quot;o Gn J u x e\]) and the interdentat voiced fricative \[.6\], which accounts for the distribution of the the unvoiced interdental fricative \[9\], in end of syllable before a voiced consonant (e.g.</Paragraph>
    <Paragraph position="17"> 'llovizna' \[L o. B'i Zh. n a\]). Finally, two phones typical of most frequent foreign loan words, i.e.</Paragraph>
    <Paragraph position="18"> the unvoiced dental affricate \[ts\] (e.g. 'pizza') and the unvoiced palatal fricative \[J\] (e.g. 'flash') were added. This gives a final set of 43 synthesis symbols.</Paragraph>
    <Paragraph position="19">  i, 'i, e, 'e a, 'a o, 'o, n, 'u w.j i~. U ~ p, b, 13, f, m, nj 0, ~, t, d, 6, Ls, s, z, n, 1, r, r: tJ, X,.la,.t, j. i k,g, y,x,q  (bold: synthesis allophones; underlined: phones from loan-words) 2. Rule component  The rule module is composed by a) table look-ups containing pre-stressed roots with hiatuses and words that do not take stress within a sentence b) stress assignment rules c) transcription rules.</Paragraph>
    <Paragraph position="20"> Vowels are transcribed before consonants. The main complexity in vowel conversion consists in disambiguation of diphthongs and hiatuses: stress position is crucial for correct transcription of these vowel sequences. However, in the rule component, they undergo a different treatment for recognition and synthesis, which is illustrated in the following, before a description of the consonant rules.</Paragraph>
    <Paragraph position="21"> 2.1. Diphthongs and hiatuses rules Z 1.1.Recognition Rules for recognition do not transcribe diphthongs and hiatuses according to the stress position. In fact, the SpeechDat transcriptions, that the conversion rules have to reproduce, always stress the first element of a vowel sequence and transcribe all closed vowels as glides (es. 'rehfiye' \[rr 'e w .jj e\], 'reir' \[rr 'ej r\], 'oir' \['oj r\]). This target can be attained by deterministic rules, that account for three realisations of \[u\]: a) deletion b) full vowel \[u\] and c) semivowel \[w\].</Paragraph>
    <Paragraph position="22"> In particular, instance (a) applies when letter 'u' (henceforth letters are included between apices) appears within sequences 'gu', 'qu' before front vowels (e.g. 'burguesia' \[b u r . G e . s 'i . a\]);  transcription (b) occurs when h' either precedes a rising diphthong or it follows a consonant different from 'g', 'q' and it is the stressed first element of a hiatus (e.g. 'cuyo' \[k u. jj o\], 'muy' \[m 'u j\]).</Paragraph>
    <Paragraph position="23"> In all other positions, both as a first element of a rising diphthong ('abuela' \[a. B w 'e. I a\]) or as a second element of a falling diphthong ('acaudalados' \[a. k a w. D a. l'a. D o s\]), 'u'is transcribed as the glide \[w\].</Paragraph>
    <Paragraph position="24"> On the other hand, \[i\] can be transcribed either as a voiced palatal approximant \[,j\] when it occurs after 'h' before a vowel (e.g. 'hiedra' \[jj 'e. D r a\]) or like the glide \[j\] when it is the second element of a falling diphthong ('afeitar' \[a. f e j. t 'a r\], 'prohibido' \[p r o j. B 'i. D o\]), or in first position of a rising diphthong ('sociedad' \[s o. T j e . D 'a D\]). Otherwise, when stressed, it is realised as \[i\] ('pingfiino' \[p i N. g w 'i. n o\]). Most of these transcriptions are incorrect from a linguistic point of view, but they are functional to the recognizer they are designed for, which does not distinguish semi-vowels from full vowels, and unstressed vowels from stressed ones.</Paragraph>
    <Paragraph position="25">  However, correct rendition of hiatuses and diphthongs is crucial for synthesis, in order to select appropriate correspondent units. A different, more complex treatment of these groups is therefore required, which involves: a) initial retrieval of pre-stressed roots containing hiatuses from a table look-up b) stress assignment to other vowel sequences c) transcription according to stress position.</Paragraph>
    <Paragraph position="26"> Only primary stress is assigned by the following procedure, which searches in the string whether: a) the word ends either by a simple vowel (i.e.</Paragraph>
    <Paragraph position="27"> preceded by a consonant e.g. 'moc'illa', 'ov'illo') or by a rising diphthong (i.e. by a vowel preceded by 'i','u' or 'y', e.g.</Paragraph>
    <Paragraph position="28"> 'l'impio', &amp;quot;agua', 'desm'ayo'); then stress is assigned to the vowel preceding the last vowel or diphthong, if it is 'a,e,o' (es.</Paragraph>
    <Paragraph position="29"> 'Can'aria'). Also 'i','u' and 'y' can be stressed in that position, if they are preceded by 'qu', 'cu', 'gu', or by a consonant or if they are initial (e.g. 'Chiqu'illlo', 'engu'anta', 'b'urgo', 'argent'ina', &amp;quot;uma').</Paragraph>
    <Paragraph position="30"> b) the word ends by a vowel, preceded by a vowel different from 'i,u,y'; then the second-last is stressed (es. 'Paragu'ay', 'can'oa', 'Bilb'ao', &amp;quot;cefal'ea').</Paragraph>
    <Paragraph position="31"> c) the words ends by 'n' or 's' preceded by a simple vowel; then stress falls on the second-last vowel if it is 'a,e,o' (e.g. &amp;quot;orden', 'c'asas'); if the second-last is 'i', 'u' or 'y' and one of these vowels is either initial, or preceded by a consonant or preceded by 'qu', 'cu', 'gu', then even 'i,u,y' can be stressed (es. ' 'umas', 'b'urgos', 'chi qu'illos').</Paragraph>
    <Paragraph position="32"> d) the word ends by a consonant different from 'n,s' preceded by a single vowel; then that vowel is stressed (e.g. 'pap'el', 'muj'er').</Paragraph>
    <Paragraph position="33"> Stressed vowels in the sequence are then transcribed as full vowels and unstressed ones either as semi-vowels when in second position of falling diphthongs ('afeitar' \[a. f e i~. t 'a r\]), or as semi-consonants if in first position of a rising diphthong ('propia' \[p r 'o. p j a\]).</Paragraph>
    <Section position="1" start_page="35" end_page="36" type="sub_section">
      <SectionTitle>
2.2. Consonant rules
</SectionTitle>
      <Paragraph position="0"> Also, consonants undergo a different treatment for synthesis and recognition: 'b, d, g' are transcribed as voiced stops \[b d g\] if initials (e.g. 'bueno') or preceded by a homorganic nasal (e.g. 'hombre', 'conde', 'mingo'), or by \[1\] for the dental stop (e.g.</Paragraph>
      <Paragraph position="1"> 'told&amp;).</Paragraph>
      <Paragraph position="2"> Otherwise, if they are internal, preceded by a consonant different from a nasal, they are realised as the corresponding voiced bilabial or velar fricative ~,y\] or dental approximant \[6\] (e.g. 'amaba', 'an'uga', 'crudo').</Paragraph>
      <Paragraph position="3"> For synthesis, voiced stops are devoiced when they precede an unvoiced phone.</Paragraph>
      <Paragraph position="4"> 'p, t, k, c' are transcribed in the following way: 'p' is deleted before 's' (e.g. 'psic61ogo'), otherwise it is realised with the corresponding bilabial stop \[p\] (e.g. 'papel').</Paragraph>
      <Paragraph position="5">  fricative \[0\] before a front vowel (e.g.</Paragraph>
      <Paragraph position="6"> 'exception', 'ceso') or as a velar voiced fricative \[y\] before 'd, n' (e.g. 'an6cdotas', 't~cnica'). For synthesis, \[p t k\] are converted in the correspondent voiced approximant allophones, before a voiced consonant (e.g. 'atm6sfera' \[a Dh m'o s fe r a\]).</Paragraph>
      <Paragraph position="7"> Nasals assimilate place of articulation of the following consonant and are transcribed with the correspondent allophones (e.g. 'amplio' \[am p l j o\], 'chanfla' ITS &amp;quot;a M fl a\], 'berrinche' \[b e r: 'i Gn TS e\], ~ingulo \['a N g u 1 o\]).</Paragraph>
      <Paragraph position="8"> 'r' is realised as a geminate \[r:\] when initial, before 'r', 'n', T, 's' (e.g. 'burrito', 'redondo', 'honra', 'alrededores'), otherwise it is transcribed as the alveolar flap \[r\].</Paragraph>
      <Paragraph position="9"> 'z' is realised as the (inter)dental voiced approximant \[6\] before a voiced consonant (e.g. 'juzgar', 'hallazgo', 'gozne'), otherwise as the unvoiced (inter)dental fricative \[0\] (e.g. 'azteca', 'raz6n', 'zapata').</Paragraph>
      <Paragraph position="10"> 'v' is transcribed as \[b\] when inital or preceded by \[hi (e.g. 'verdad', 'conviene'), otherwise as the bilabial voiced approximant (e.g. 'ovillo'). 'x' is transcribed as \[ks\] when initial (in Catalan words, present in SpeechDat, e.g. 'Xavier') or as \[ys\] when followed by a vowel or 'h' (e.g.</Paragraph>
      <Paragraph position="11"> 'examen', 'exhortaci6n'); for synthesis 'x' in these context is realised as \[k s\].</Paragraph>
      <Paragraph position="12"> 'y' is always transcribed as the palatal voiced approximant \[j\] in every condition for recognition, whereas two allophones are distinguished for synthesis: if initial, or internal after \[1\], \[hi, 'y' is realised as a palatal voiced stop \[j\] (e.g. 'yelmo', 'inyectar'), otherwise as the palatal approximant \[j\] (e.g. 'cayado').</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML