File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2220_metho.xml

Size: 17,493 bytes

Last Modified: 2025-10-06 14:15:00

<?xml version="1.0" standalone="yes"?>
<Paper uid="P98-2220">
  <Title>Automatic English-Chinese name transliteration for development of multilingual resources</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 English-Chinese Transliteration
</SectionTitle>
    <Paragraph position="0"> We use the term transliteration to refer generally to the problem of the identification of a specific textual form in an output language (in our case Chinese characters) which corresponds to a specific textual form in an input language (an English word or phrase). For words with semantic content, this process is essentially equivalent to the translation of individual words.</Paragraph>
    <Paragraph position="1"> So, the English word &amp;quot;black&amp;quot; is associated with a concept which is expressed as &amp;quot;~&amp;quot; (\[h~i\]) in Chinese. In thiscase, a dictionary search establishes the input-output correspondence.</Paragraph>
    <Paragraph position="2"> For words with little or no semantic content, such as personal and place names, dictionary lookup may suffice where standard translations exist, but in general it cannot be assumed that names will be included in the bilingual dictionary. In multilingual systems designed only for languages sharing the roman alphabet, such names pose no problem as they can simply be included unaltered in output texts in any of the languages. They cannot, however, be included in a Chinese text, as the roman characters cannot standardly be realized in the Han character set.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="1352" type="metho">
    <SectionTitle>
3 Name Transliteration
</SectionTitle>
    <Paragraph position="0"> English-Chinese name transliteration occurs on the basis of pronunciation. That is, the written English word is mapped to the written Chinese character(s) via the spoken form associated with the word. The idealized process consists of:  1. mapping an English word (grapheme) to a phonemic representation 2. mapping each phoneme composing the word to a corresponding Chinese character  In practice, this process is not entirely straightforward. We outline several issues complicating the automation of this process below. The written form of English is less than normalized. A particular English grapheme (letter or letter group) does not always correspond to a single phoneme (e.g. ea is pronounced differently in eat, threat, heart, etc.), and many English multi-letter combinations are realised as a single phoneme in pronunciation (so f, if, ph, and gh can all map to /f/) (van den Bosch 1997). An important step in grapheme-phoneme conversion is the segmentation of words into syllables.</Paragraph>
    <Paragraph position="1"> However, this process is dependent on factors such as morphology. The syllabification of &amp;quot;hothead&amp;quot; divides the letter combination th, while the same combination corresponds to a single phoneme in &amp;quot;bother&amp;quot;. Automatic identification of the phonemes in a word is therefore a difficult problem.</Paragraph>
    <Paragraph position="2"> Many approaches exist in the literature to solving the grapheme-phoneme conversion problem. Divay and Vitale (1997) review several of these, and introduce a rule-based approach (with 1,500 rules for English) which achieved 94.9% accuracy on one corpus and 64.37% on another. Van den Bosch (1997) evaluates instance-based learning algorithms and a decision tree algorithm, finding that the best of these algorithms can achieve 96.9% accuracy.</Paragraph>
    <Paragraph position="3"> Even when a reliable grapheme-to-phoneme conversion module can be constructed, the English-Chinese transliteration process is faced with the task of mapping phonemes in the source language to counterparts in the target language, difficult due to phonemic divergence between the two languages. English permits initial and final consonant clusters in syllables. Mandarin Chinese, in contrast, primarily has a consonant-vowel or consonant-vowel-\[nasal consonant (/n/ or /0/)\] syllable structure. English consonant clusters, when pronounced within the Chinese phonemic system, must either be reduced to a single phoneme or converted to a consonantvowel-consonant-vowel structure by inserting a vowel between the consonants in the cluster. In addition to these phonotactic constraints, the range of Chinese phonemes is not fully compatible with those of English. For instance, Mandarin does not use the phoneme Iv/ and so that phoneme in English words is realized as either/w/or/f/in the Chinese counterpart.</Paragraph>
    <Paragraph position="4"> We focus on the specific problem of country name transliteration from English into Chinese.</Paragraph>
    <Paragraph position="5"> The algorithm does not aim to specify general grapheme-phoneme conversion for English, but only for the subset of English words relevant to place name transliteration. This limited domain rarely exhibits complex morphology and thus a robust morphological module is not included. In addition, foreign language morphemes are treated superficially. Thus, the algorithm transliterates the &amp;quot;-istan&amp;quot; (a morpheme having meaning in Persian) of &amp;quot;Afghanistan&amp;quot; in spite of a standard transliteration which omits this morpheme.</Paragraph>
    <Paragraph position="6"> The transliteration process is intended to be based purely on phonetic equivalency. On occasion, country names will have some additional meaning in English apart from the referential function, as in &amp;quot;The United States&amp;quot;. Such names are often translated semantically rather than phonetically in Chinese. However, this in not uniformly true, for example &amp;quot;'Virgin&amp;quot; in &amp;quot;British Virgin Islands&amp;quot; is transliterated. We therefore introduce a dictionary lookup step prior to commencing transliteration, to identify cases which have a standard translation.</Paragraph>
    <Paragraph position="7"> The transliteration algorithm results in a string of Han characters, the ideographic script used for Chinese. While the dialects of Chinese share the same orthography, they do not share the same pronunciation. This algorithm is based on the Mandarin dialect.</Paragraph>
    <Paragraph position="8"> Because automation of this algorithm is our primary goal, the transliteration starts with a written source and it is assumed that the orthography represents an assimilated pronunciation, even though English has borrowed many country names. This is permitted only because the mapping from English phonemes to Chinese phonemes loses a large degree of variance: English vowel monothongs are flattened into a fewer number Chinese monothongs. However, Chinese has a larger set of diphthongs and triphthongs. This results in approximating a prototypical vowel by the closest match within the set of Chinese vowels.</Paragraph>
  </Section>
  <Section position="5" start_page="1352" end_page="1355" type="metho">
    <SectionTitle>
4 An Algorithm for Auto Transliteration
</SectionTitle>
    <Paragraph position="0"> The algorithm begins with a proper noun phrase (PNP) and returns a transliteration in Chinese characters. The process involves five main stages: Semantic Abstraction, Syllabification, Sub-syllable Divisions, Mapping to Pinyin, and</Paragraph>
    <Section position="1" start_page="1352" end_page="1353" type="sub_section">
      <SectionTitle>
Mapping to Han Characters.
4.1 Semantic Abstraction
</SectionTitle>
      <Paragraph position="0"> The PNP may consist of one or more words. If it is longer than a single word, it is likely that some part of it may have an existing semantic translation. &amp;quot;The&amp;quot; and &amp;quot;of' are omitted by  convention. To ensure that such words as &amp;quot;Unitear&amp;quot; are translated and not transliterated ~, we pass the entire PNP into a dictionary in search of a standard translation. If a match is not immediately successful, we break the PNP into words and pass each word into the dictionary to check for a semantic translation 2. This portion of the algorithm controls which words in the PNP are translated and which are transliterated.</Paragraph>
      <Paragraph position="1"> Search for PNP in dictionary If exact match exists then return corresponding characters else remove article 'The' and preposition 'of' For each (remaining) word in PNP search for word in dictionary If exact match exists add matching characters to output string 3 else if the word is not already a chinese word transliterate the word and add to output string</Paragraph>
    </Section>
    <Section position="2" start_page="1353" end_page="1353" type="sub_section">
      <SectionTitle>
4.2 Transliteration 1: Syllabification
</SectionTitle>
      <Paragraph position="0"> Because Chinese characters are monosyllabic, each word to be transliterated must first be divided into syllables. The outcome is a list of syllables, each with at least one vowel part.</Paragraph>
      <Paragraph position="1"> We distinguish between a consonant group and a consonant cluster, where a group is an arbitrary collection of consonant phonemes and a cluster is a known collection of consonants. Like Divay and Vitale (1997), we identify syllable boundaries on the basis of consonant clusters and vowels (ignoring morphological considerations).</Paragraph>
      <Paragraph position="2"> Any consonant group is divided into two parts, by identifying the final consonant cluster or lone consonant in that group and grouping that consonant (cluster) with the following vowel.</Paragraph>
      <Paragraph position="3"> The sub-syllabification algorithm then further divides each identified syllable. While this procedure may not always strictly divide a word into standard syllables, it produces syllables of the form consonant-vowel, the common pronunciation of most Chinese characters.</Paragraph>
      <Paragraph position="4">  Prior to the syllabification process, the input string must be normalized, so that consonant I The historical interactions of some European and Asian nations has lead to names that include some special meaning. Interaction with the dialects of the South may have produced transliterations based on regional pronunciations which are accepted as standard.  2 There is some discrepency among speakers about the balance between translation and transliteration. For instance, the word 'New' is translated by some and transliterated by others. 3 Identification of syntactic constraints is work-in-progress. Known  nouns such as 'island' are moved to the end of the phrase while modifers (remaining words) maintain their relative order. clusters are reduced to a single phoneme represented by a single ASCII character (e.g. ff and ph are both reduced to f). Instances of 'y' as a vowel are also replaced by the vowel 'i'.</Paragraph>
      <Paragraph position="5"> For each pair of identical consonants in the input string Reduce the pair to a singular instance of the consonant For each substring in the input string listed in Appendix A Replace substring with the corresponding phoneme (App. A) For all instances where 'y' is not followed by a vowel or 'y' follows a consonant Replace this instance of 'y' with the vowel 'i' When 'e' is followed by a consonant and an 'ia#' ;; (where # is the end of string marker) Replace the the preceding 'e' with 'i  If string begins with a consonant Then read/store consonants until next vowel and call this substring initial_consonant_group (or icg) Read/store vowels until next consonant and call this substring vowels (or v) If more characters, read/store consonants until next vowel and call this final_consonant_cluster (or fcc) If length of fcc = 1 and fcc followed by substrings 'e#' final_vowel (or fv) = 'e' syllable = icg + v +fcc +fv else if the last two letters of fcc form a substring in Appendix B then this string has a double consonant cluster next_syllable (or ns) = the last two letters of fcc reset fcc to be fcc with ns removed else next_syllable (or ns) = the last letter of fcc reset fcc to be fcc with ns removed syllable = icg + v + fcc Store syllable in a list Call syllabification procedure on substring \[ns .. #\]</Paragraph>
    </Section>
    <Section position="3" start_page="1353" end_page="1354" type="sub_section">
      <SectionTitle>
4.3 Transliteration 2: Sub-syllable Divisions
</SectionTitle>
      <Paragraph position="0"> The algorithm then proceeds to find patterns within each syllable of the list. The pattern matching consists of splitting those consonant clusters that cannot be pronounced within the Chinese phonemic set. These separated consonants are generally pronounced by inserting a context-dependent vowel. The Pinyin romanization consists of elements that can be described as consonants (including three consonant clusters &amp;quot;zh&amp;quot;, &amp;quot;ch&amp;quot; and &amp;quot;sh&amp;quot;) and vowels which consist of monothongs, diphthongs and vowels followed by a nasal In/ or /rj/.</Paragraph>
      <Paragraph position="1"> Consonants that follow a set of vowels are examined to determine if they &amp;quot;modify&amp;quot; the vowel. Such consonants include the alveolar approximant /r/, the pharyngeal fricative /h/ or the above mentioned nasal consonants. These are then joined to the vowel to form the &amp;quot;vowel part&amp;quot;. The &amp;quot;vowel part&amp;quot; may be divided so as to map onto a Pinyin syllable. Any remaining consonants are then split by inserting a vowel.</Paragraph>
      <Paragraph position="2">  For each syllable s identified above Initialize subsyllable_list (or s/) to the empty string Identify initial_consonant_group s~g While s~g is non-null If the first two letters of s~g appear in Appendix C then consonant_pair (or cp) = those two letters append cp to sl reset S~g to be the remainder of S~cg else add the first letter of S~=gtO sl reset S~g to be the remainder of S~=g Identify vowels (v) in s append v to last element of sl identify final_consonant_cluster (fcc) of s if sfcc is non-null if Sfcc is equal to 'n', 'm', 'ng', 'h' or 'r' identify final vowels of s (Sly) If s~ exists and Sfcc = 'n' or 'm' append Sfc= to last element of sl else if s~ exists and Sfcc not = 'n' or 'm' append SfcC/+ sty to last element of sl else if Sly exists and sfcC/= 'h' or 'r' discard sfcC/+ s~ else while sfcc is non null If the first two letters of sfcC/ appear in Appendix C then cp = those two letters append cp to sl reset S~cctO be the remainder of sfcC/ else add the first letter of SfcctO sl reset stcC/ to be the remainder of Sfc= For each element of sl If element does not include a vowel Insert context dependent vowel This procedure will subdivide the syllable into pronounceable sections for mapping to the Chinese phoneme set. Thus each subsection should be of the form &lt;cv&gt;, &lt;v&gt; or &lt;vc,&gt;, where &amp;quot;c&amp;quot; is a single consonant, &amp;quot;v&amp;quot; is a monothong or diphthong and &amp;quot;c,&amp;quot; is a nasal consonant.</Paragraph>
    </Section>
    <Section position="4" start_page="1354" end_page="1354" type="sub_section">
      <SectionTitle>
4.4 Transliteration 3: Mapping to Pinyin
</SectionTitle>
      <Paragraph position="0"> The subsyllables are then mapped to the Pinyin romanization standard equivalents by means of a table (Appendix D). This table is indexed on the columns on the consonants of the subsyllable, and on the rows on the vowel part of the subsyllable. When an exact match cannot be found we prioritize aspects of the subsyllable.</Paragraph>
      <Paragraph position="1"> Often the highest priority is the initial consonant. Of next priority are nasal consonants. This may demand an alternate vowel choice if no such combination of phonemes exists in the table.</Paragraph>
    </Section>
    <Section position="5" start_page="1354" end_page="1355" type="sub_section">
      <SectionTitle>
4.5 Transliteration 4: Mapping to Han
</SectionTitle>
      <Paragraph position="0"> Once the Pinyin of a word is established, the Han characters are simply extracted from a table of  specifying the Pinyin &lt;cv&gt; Han character correspondence (Appendix E). In some cases, multiple characters might be possible but the table includes only the most common.</Paragraph>
    </Section>
  </Section>
  <Section position="6" start_page="1355" end_page="1355" type="metho">
    <SectionTitle>
5 An Example
</SectionTitle>
    <Paragraph position="0"> The transliteration of the place name &amp;quot;Faeroe Islands&amp;quot; according to the algorithm will proceed as follows:  1. No match for &amp;quot;Faeroe&amp;quot; in the dictionary, so must be transliterated : 2. Divide Faeroe into two syllables by recognizing the syllabic break falls before the &amp;quot;?' in the middle consonant group. 3. Map/fae/and/roe/onto their Chinese equivalents. Since no vowel form/ae/exists in Chinese, this is mapped to/ei/. The Irl of the second syllable is mapped to /1/ and /oe/ is correspondingly mapped to luol.</Paragraph>
    <Paragraph position="1"> 4. Since each syllable is of the form &lt;cv&gt;, no subsyllabic processing is required.</Paragraph>
    <Paragraph position="2"> 5. The transliterated phrase &amp;quot;fei luo&amp;quot; is the mapped to the Han characters: &amp;quot;-:lie ~'&amp;quot; 6. &amp;quot;Islands&amp;quot; is searched for and found in the dictionary : &amp;quot;1~'%&amp;quot; (qOn d~o) 7. The characters of the translated &amp;quot;Islands&amp;quot; are placed after the transliteration of &amp;quot;Faeroe&amp;quot; : &amp;quot;tlz ~' ~ ,%&amp;quot; (f~i/0o qOn d~o)</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML