File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1611_metho.xml

Size: 16,559 bytes

Last Modified: 2025-10-06 14:09:18

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-1611">
  <Title>A Transcription Scheme for Languages Employing the Arabic Script Motivated by Speech Processing Application</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Phonetic Labels (USCPron)
</SectionTitle>
    <Paragraph position="0"> One of the requirements of an ASR system is a phonetic transcription scheme to represent the pronunciation patterns for the acoustic models.</Paragraph>
    <Paragraph position="1"> Persian has a total of 29 sounds in its inventory, six vowels (Section 2.1) and 23 consonants (Section 2.2). The system that we created to capture these sounds is a modified version of the International Phonetic Alphabet (IPA), called USCPron(unciation). In USCPron, just like the IPA, there is a one-to-one correspondence between the sounds and the symbols representing them.</Paragraph>
    <Paragraph position="2"> However, this system, unlike IPA does not require special fonts and makes use of ASCII characters.</Paragraph>
    <Paragraph position="3"> The advantage that our system has over other systems that use two characters to represent a single sound is that following IPA, our system avoids all ambiguities.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 Vowels
</SectionTitle>
      <Paragraph position="0"> Persian has a six-vowel system, high to low and front and back. These vowels are: [i, e, a, u, o, A], as are exemplified by the italicized vowels in the following English examples: beat , bet , bat , pull , poll and pot . The high and mid vowels are represented by the IPA symbols. The low front vowel is represented as [a], while the low back vowel is represented as [A]. There are no diphthongs in Persian, nor is there a tense/lax distinction among the vowels (Windfuhr, Gernot</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Consonants
</SectionTitle>
      <Paragraph position="0"> In addition to the six vowels, there are 23 distinct consonantal sounds in Persian. Voicing is phonemic in Persian, giving rise to a quite symmetric system. These consonants are represented in Table 3 based on the place (bilabial (BL), lab-dental (LD), dental (DE), alveopalatal (AP), velar (VL), uvular (UV) and glottal (GT)) and manner of articulation (stops (ST), fricatives (FR), affricates (AF), liquids (LQ), nasals (NS) and glides (GL)) and their voicing ([-v(oice)] and [+v(oice)].</Paragraph>
      <Paragraph position="2"> Many of these sounds are similar to English sounds. For instance, the stops, [p, b, t, d, k, g] are similar to the italicized letters in the following English words: potato , ball , tree , doll , key and dog respectively. The glottal stop [?] can be found in some pronunciations of button , and the sound in between the two syllables of uh oh . The uvular stop [q] does not have a correspondent in English. Nor does the velar fricative [x]. But the rest of the fricatives [f, v, s, z, S, Z, h] have a corresponding sound in English, as demonstrated by the following examples fine , value , sand , zero , shore , pleasure and hello . The affricates [C] and [J] are like their English counterparts in the following examples: church and judge . The same is true of the nasals [m, n] as in make and no ; liquids [r, l], as in rain and long and the glide [y], as in yesterday . (The only distinction between Persian and English is that in Persian [t, d, s, z, l, r, n] are dental sounds, while in English they are alveolar.) As is evident, whenever possible, the symbols used are those of the International Phonetic Alphabet (IPA).</Paragraph>
      <Paragraph position="3"> However, as mentioned before because IPA requires special fonts, which are not readily available for a few of the sounds, we have used an ASCII symbol that resembled the relevant IPA symbol. The only difference between our symbols and the ones used by IPA are in voiceless and voiced alveopalatal fricatives [S] and [Z], the voiceless and voiced affricates [C] and [J], and the palatal glide [y]. In the case of the latter, we did not want to use the lower case j , in order to decrease confusion.</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Orthographic Labels (USCPers)
</SectionTitle>
    <Paragraph position="0"> We proceed in this section to present an alternative orthographic system for Persian, as a first step in the creation of the USCPers+ system that will be presented later. The Persian writing system is a consonantal system with 32 letters in its alphabet (Windfuhr, 1987). All but four of these letters are direct borrowing from the Arabic writing system. It is important to note that this borrowing was not a total borrowing, i.e., many letters were borrowed without their corresponding sound. This has resulted in having many letters with the same sound (homophones). However, before discussing these cases, let us consider the cases in which there is no homophony, i.e., the cases in which a single letter of the alphabet is represented by a single sound.</Paragraph>
    <Paragraph position="1"> In order to assign a symbol to each letter of the alphabet, the corresponding letter representing the sound of that letter was chosen. So, for instance for the letter a0 , which is represented as [p] in USCPron, the letter p was used in USCPers(ian).</Paragraph>
    <Paragraph position="2"> These letters are:</Paragraph>
    <Paragraph position="4"/>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Non-Homophonic Consonants
</SectionTitle>
      <Paragraph position="0"> As mentioned above, this partial borrowing of the Arabic writing system has given rise to many homophonic letters. In fact, thirteen letters of the alphabet are represented by only five sounds.</Paragraph>
      <Paragraph position="1"> These sounds and the corresponding letters are presented below:  In these cases, several strategies were used. If there were two letters with the same sound, the lower case and the upper case letters were used, as in table 5. In all these cases, the lower case letter is assigned to the most widely used letter and the upper case, for the other.</Paragraph>
      <Paragraph position="2"> [t] a30 t a31 T [q] a32 q a33 Q [h] a34 h a35 H Table 5 USCPers(ian) Symbols:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Homophonic Consonants 1
</SectionTitle>
      <Paragraph position="0"> In the case of the letters represented as [s] and [z] in USCPron, because the corresponding upper case letters were already assigned, other symbols were chosen. For the letters sounding [s], s , $ and &amp; and for the letters sounding [z], z , 2 ,  These letters are not the only ambiguous letters in Persian. The letters a43 and a44 can be used as a consonant as well as a vowel, [y] and [i] in the case of the former and [v], [o] and [u] in the case of the latter. However, in USCPers, the symbols y and v were assigned to them, leaving the pronunciation differences for USCPron to capture. For instance, the word for you is written as tv in USCPers, but pronounced as [to], and the word but is written as vly and pronounced as [vali].</Paragraph>
      <Paragraph position="1"> As is the characteristics of languages employing the Arabic script, for the most part the vowels are not represented and Persian is no exception. The only letter in the alphabet that represents a vowel is the letter alef . This letter has different appearances depending on where it appears in a word. In the word initial position, it appears as a45 , elsewhere it is represented as a46 . Because the dominant sound that this letter represents is the sound [A], the letter A was assigned to represent a46 , which has a wider distribution; V was assigned for the more restricted version a45 . In Persian, like in Arabic, diacritics mark the vowels, although they are not used in writing, unless to avoid ambiguities. Therefore, in our system, we ignored the diacritics.</Paragraph>
      <Paragraph position="2">  Finally in creating the one-to-one mapping between the Persian alphabet and USCPers, we need to deal with the issue of pure Arabic letters that appear in a handful of words. We see the same situation in the borrowed words in English, for instance the italicized letters in caaeon or na ve, are not among the letters of the English alphabet, but they appear in some words used in English. In order to ensure a one-to-one representation between the orthography and USCPers, these letters were each assigned a symbol, as presented on Table7.</Paragraph>
      <Paragraph position="3"> USCPers, therefore, provides us with a way to capture each letter of the alphabet with one and only one ASCII symbol, creating a comparable system to USCPron for the orthography.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 USCPers/USCPron: Two Way Ambiguity
</SectionTitle>
    <Paragraph position="0"> As was noted in the previous section, vowels are not usually represented in orthography and there are many homophonic letters. These two properties can give rise to two sources of ambiguity in Persian which can pose a problem for speech-to-speech machine translation: (i) in which two distinct words have the same pronunciation (homophones), like pair and pear in English and the Persian words like sd and $d , which are both pronounced as [sad] and (ii) in which one orthographic representation can have more than one pronunciation (homographs) similar to the distinction between the two English words convict (n) and convict (v), which are both spelled c-o-n-vi-c-t, but different stress assignments create different pronunciations. It is important to note that English has a handful of such homographic pairs, while in Persian homographs are very common, contributing to much ambiguity. In this section, we will discuss the transcription system we have adopted in order to eliminate these ambiguities.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.1 Homophones
</SectionTitle>
      <Paragraph position="0"> The examples in Table 8 illustrate the case in (i) (the letters with the same sounds are underlined).</Paragraph>
      <Paragraph position="1"> As is evident by the last column in Table 8, in each case, the two words have similar pronunciation, but different spellings.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Spellings
</SectionTitle>
      <Paragraph position="0"> The word for life ends in t , while the word for backyard ends in T . In the other examples, because there is no difference in the pronunciation of h / H and s / $ , we get ambiguity between Eve / air and hundred / dam . Therefore, this type of ambiguity appears only in speech.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.2 Homographs
</SectionTitle>
      <Paragraph position="0"> The second case of ambiguity is illustrated by the examples in the following table:  Here, we see that in the middle column two words that have the same orthographic representation correspond to different pronunciations (Column 3), marking different meanings, as is indicated by the gloss. This type of ambiguity arises only in writing and not speech.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
4.3 Solution: USCPers+
</SectionTitle>
      <Paragraph position="0"> Because of the ambiguity presented by the lack of vowels the data transcribed in USCPers cannot be used either by MT or for language modeling in ASRs, without significant loss of information. In order to circumvent this problem, we adopted a modified version of USCPers. In this new version, we have added the missing vowels, which would help to disambiguate. (Because this new version is USCPers + vowels, it is called USCPers+.) In other words, USCPers+ provides both the orthographic information as well as some phonological information, giving rise to unique words. Let us reconsider the examples we saw above using this new transcription system. A modified version of Table 8 is presented in Table</Paragraph>
    </Section>
    <Section position="5" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
Same Spelling &amp; Different Pronunciations
</SectionTitle>
      <Paragraph position="0"> Data in Column 4 and Column 2 of Tables 10 and 11, respectively, show that USCPron and USCPers can give rise to ambiguity, while no ambiguity exists in USCPers+, Column 3.</Paragraph>
      <Paragraph position="1"> The following sentence also illustrates this point, where the words thick and maid from Table 11 are used. Assume that ASR receives the audio input in (1) represented in USCPron: (1) USCPron: [in koloft ast] Gloss: this thick is Translation: This is thick If ASR outputs USCPers, as in (2),  (2) USCPers: Ayn klft Ast the MT output in the English language can choose either: (3) a. This is thick b. This is a maid as a possible translation. However, using USCPers+ instead of USCPers would avoid this ambiguity: (4) USCPers+: Ayn koloft Ast (cf. (2))  As evident, there is a significant benefit by using USCPers+.</Paragraph>
      <Paragraph position="2"> The discussion of the conventions that have been adopted in the use of USCPers+ and USCPron, e.g., not including punctuations or spelling out numbers, is beyond the scope of this paper. However, it is important to note that by adopting a reasonable number of conventions in our transcription of USCPers+ and USCPron, we have been able to provide a complete transcription convention for acoustic models and language models for the ASRs, TTSs and MTs for our English to Persian translation system.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Further Issue: Dealing with the Lack of
</SectionTitle>
    <Paragraph position="0"> Data Despite the significant advantages of employing the USCPers+ transcription scheme, a drawback is the lack of data in this format. To address this shortcoming, semi-automated techniques of data conversion have been developed that take into consideration the statistical structure of the language. Fig. 2 depicts a network that can be inferred from a relatively small amount of humanly transliterated data. By employing statistical decoding techniques through such a model, the most likely USCPers+ sequence can be generated using minimal human intervention.</Paragraph>
    <Paragraph position="1"> Consider for example the sentence SS mn drd myknd and the network structure shown above. It is likely that the combination man dard and dard mykonad have been seen in the manually generated data, and thus the decoder is likely to chose the path man dard mykonad as the correct transliteration.</Paragraph>
    <Paragraph position="2"> Manual decision can be made in the cases that the system reaches a statistical ambiguity (usually in cases such as Ayn klft Ast ) or that insufficient training data exist for the specific region of decoding.</Paragraph>
    <Paragraph position="3"> Fig 2. The possible transitions between words are probabilistically denoted in a language model, which can be employed for decoding of the most likely path, given several possibilities. Shown above are the possibilities for the decoding of the utterance SS mn drd myknd .</Paragraph>
    <Paragraph position="4"> The first ambiguity is rare, and usually involves short segments of text. Thus as the models improve, and we move to higher orders of decoding, the statistical ambiguity becomes less significant. Similarly, the unknown words keep decreasing as new converted data feeds back into the training corpus.</Paragraph>
    <Paragraph position="5"> In our experiments, as the amount of training data grew from about 16k to 22k words, the precision in transliteration increased from 98.85% to 99.2%, while at the same time the amount of manual intervention was reduced from 39.6% to 22%. It should be noted that by changing the decision thresholds the intervention can fall significantly lower, to 9.4% with a training corpus of 22k words, but this has the effect of a lower precision in the order of 98.8%.</Paragraph>
    <Paragraph position="6"> An indepth discussion of the techniques employed for the transliteration process is presented in Georgiou, et.al (2004).</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML