XML Viewer - c04-1103

File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/c04-1103_metho.xml
Size: 21,906 bytes
Last Modified: 2025-10-06 14:08:46
<?xml version="1.0" standalone="yes"?>
<Paper uid="C04-1103">
  <Title>Direct Orthographical Mapping for Machine Transliteration</Title>
  <Section position="3" start_page="1" end_page="1" type="metho">
    <SectionTitle>
: Shi-Mi-Si)&amp;quot; in
</SectionTitle>
    <Paragraph position="0"> Chinese form a pair of transliteration and backtransliteration. In many natural language processing tasks, such as multilingual named entity and term processing, machine translation, corpus alignment, cross lingual information retrieval and automatic bilingual dictionary compilation, automatic name transliteration has become an indispensable component.</Paragraph>
    <Paragraph position="1"> Recent efforts are reported for several language pairs, such as English/Chinese (Meng et al., 2001; Virga et al., 2003; Lee et al., 2003; Gao et al., 2004; Guo et al., 2004), English/Japanese (Knight et al., 1998; Brill et al., 2001; Bilac et al., 2004),  Pinyin is the standard Romanization of Chinese.</Paragraph>
    <Paragraph position="2"> English/Korean (Oh et al., 2002; Sung et al., 2000), and English/Arabic (Yaser et al., 2002).</Paragraph>
    <Paragraph position="3"> Most of the reported works utilize a phonetic clue to resolve the transliteration through a multiple step phonemic mapping where algorithms, such as dictionary lookup, rule-based and machine learning-based approaches, have been well explored.</Paragraph>
    <Paragraph position="4"> In this paper, we will discuss the limitation of the previous works and present a novel framework for machine transliteration. The new framework carries out the transliteration by direct orthographical mapping (DOM) without any intermediate phonemic mapping. Under this framework, we further propose a joint source-channel transliteration mode (n-gram TM) as an alternative machine learning-based approach to model the source-target word orthographic association. Without the loss of generality, we evaluate the performance of the proposed method for English/Chinese and English/Japanese pairs.</Paragraph>
    <Paragraph position="5"> An experiment that compares the proposed method with several state-of-art approaches is also presented. The results reveal that our method outperforms other previous methods significantly. The reminder of the paper is organized as follows. Section 2 reviews the previous work. In section 3, the DOM framework and n-gram TM model are formulated. Section 4 describes the evaluation results and compares our method with other reported work. Finally, we conclude the study with some discussions.</Paragraph>
  </Section>
  <Section position="4" start_page="1" end_page="1" type="metho">
    <SectionTitle>
2 Previous Work
</SectionTitle>
    <Paragraph position="0"> The topic of machine transliteration has been studied extensively for several different language pairs, and many techniques have been proposed.</Paragraph>
    <Paragraph position="1"> To better understand the nature of the problem, we review the previous work from two different viewpoints: the transliteration framework and the transliteration model. The transliteration model is built to capture the knowledge of bilingual phonetic association and subsequently is applied to the transliteration process.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.1 Transliteration Framework
</SectionTitle>
      <Paragraph position="0"> The phoneme-based approach has received remarkable attention in the previous works (Meng et al., 2001; Virga et al., 2003; Knight et al., 1998; Oh et al., 2002; Sung et al., 2000; Yaser et al., 2002; Lee et al., 2003). In general, this approach includes the following three intermediate phonemic/orthographical mapping steps:  1) Conversion of a source language word into its phonemic representation (grapheme-tophoneme conversion, or G2P); 2) Transformation of the source language phonemic representation to the target language phonemic representation; 3) Generation of target language orthography  from its phonemic representation (phonemeto-grapheme conversion, or P2G).</Paragraph>
      <Paragraph position="1"> To achieve phonetic equivalent transliteration, phoneme-based approach has become the most popular approach. However, the success of phoneme-based approach is limited by the following constraints:</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="1" end_page="1" type="metho">
    <SectionTitle>
1) Grapheme-to-phoneme conversion,
</SectionTitle>
    <Paragraph position="0"> originated from text-to-speech (TTS) research, is far from perfect (The Onomastica Consortium, 1995), especially for the name of different language origins. 2) Cross-lingual phonemic mapping presents a great challenge due to phonemic divergence between some language pairs, such as Chinese/English, Japanese/English (Wan and Verspoor, 1998; Meng et al., 2001).</Paragraph>
  </Section>
  <Section position="6" start_page="1" end_page="1" type="metho">
    <SectionTitle>
3) The conversion of phoneme-to-grapheme
</SectionTitle>
    <Paragraph position="0"> introduces yet another level of imprecision, esp. for the ideographic language, such as Chinese. Virga and Khudanpur (2003) reported 8.3% absolute accuracy drops when converting from Pinyin to Chinese character.</Paragraph>
    <Paragraph position="1"> The three error-prone steps as stated above lead to an inferior overall system performance. The complication of multiple steps and introduction of intermediate phonemes also incur high cost in system development when moving from one language pair to another, because we have to work on language specific ad-hoc phonic rules.</Paragraph>
    <Section position="1" start_page="1" end_page="1" type="sub_section">
      <SectionTitle>
2.2 Transliteration Model
</SectionTitle>
      <Paragraph position="0"> Transliteration model is a knowledge base to support the execution of transliteration strategy. To build the knowledge base, machine learning or rule-based algorithms are adopted in phoneme-based approach. For instance, noisy-channel model (NCM) (Virga et al., 2003; Lee et al., 2003), HMM (Sung et al., 2000), decision tree (Kang et al., 2000), transformation-based learning (Meng et al., 2001), statistical machine transliteration model (Lee et al., 2003), finite state transducers (Knight et al., 1998) and rule-based approach (Wan et al., 1998; Oh et al., 2002). It is observed that the reported transliteration models share a common strategy, that is:  1) To model the transformation rules; 2) To model the target language; 3) To model the above both;  However, the modeling of different knowledge is always done independently. For example, NCM and HMM (Virga et al., 2003; Lee et al., 2003; Sung et al., 2000) model the transformation mapping rules and the target language separately; decision tree (Kang et al., 2000), transformation-based learning (Meng et al., 2001), finite state transducers (Knight et al., 1998) and statistical machine transliteration model (Lee et al., 2003) only model the transformation rules.</Paragraph>
    </Section>
  </Section>
  <Section position="7" start_page="1" end_page="111" type="metho">
    <SectionTitle>
3 Direct Orthographical Mapping
</SectionTitle>
    <Paragraph position="0"> To overcome the limitation of phoneme-based approach, we propose a unified framework for machine transliteration, direct orthographical mapping (DOM). The DOM framework tries to model phonetic equivalent association by fully exploring the orthographical contextual information and the orthographical mapping.</Paragraph>
    <Paragraph position="1"> Under the DOM framework, we propose a joint source-channel transliteration model (n-gram TM) to capture the source-target word orthographical mapping relation and the contextual information.</Paragraph>
    <Paragraph position="2"> Unlike the noisy-channel model, the joint source-channel model does not try to capture how the source names can be mapped to the target names, but rather how both source and target names can be generated simultaneously.</Paragraph>
    <Paragraph position="3"> The proposed framework is applicable to all language pairs. For simplicity, in this section, we take English/Chinese pair as example in the formulation, where E2C refers to English to Chinese transliteration and C2E refers to Chinese to English back-transliteration.</Paragraph>
    <Section position="1" start_page="1" end_page="111" type="sub_section">
      <SectionTitle>
3.1 Transliteration Pair and Alignment
</SectionTitle>
      <Paragraph position="0"> Suppose that we have an English name</Paragraph>
      <Paragraph position="2"> y are Chinese characters. The English name a and its Chinese Transliteration b can be segmented into a series of substrings:</Paragraph>
      <Paragraph position="4"> We call the substring as transliteration unit and each English transliteration unit</Paragraph>
      <Paragraph position="6"> form a transliteration pair. An alignment between a and b is defined as g with</Paragraph>
      <Paragraph position="8"> a monograph, a digraph or a trigraph and so on for English. For example, &amp;quot;A |a Bu |b Lu |ru Zuo |zzo&amp;quot; is one alignment of Chinese-English word pair &amp;quot;A Bu Lu Zuo &amp;quot; and &amp;quot;abruzzo&amp;quot;.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.2 DOM Transliteration Framework
</SectionTitle>
      <Paragraph position="0"> By the definition of a , b and g , the E2C transliteration can be formulated as  To reduce the computational complexity, in eqn. (1), common practice is to replace the summation with maximization.</Paragraph>
      <Paragraph position="1"> The eqn. (1) and (2) formulate the DOM transliteration framework. ),,( gbaP is the joint probability of a , b and g , whose definition depends on the transliteration model which will be discussed in the next two subsections. Unlike the phoneme-based approach, DOM does not need to explicitly model any phonetic information of either source or target language. Assuming sufficient training corpus, DOM transliteration framework is to capture the phonetic equivalents through  orthographic mapping or transliteration pair i ce &gt;&lt; , . By eliminating the potential imprecision introduced through a multiple-step  phonetic mapping in the phoneme-based approach, DOM is expected to outperform. In contrast to phoneme-based approach, DOM is purely datadriven, therefore can be extended across different language pairs easily.</Paragraph>
    </Section>
    <Section position="3" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.3 n-gram TM under DOM
</SectionTitle>
      <Paragraph position="0"> Given a and b , the joint probability of ),,( gbaP is the probability of alignment g , which can be formulated as follows:  In eqn. (3), the transliteration pair is used as the token to derive n-gram statistics, so we call the model as n-gram TM transliteration model.</Paragraph>
      <Paragraph position="1">  The above block diagram illustrates typical system structure of DOM. The training of n-gram TM model is discussed in section 3.5. Given a language pair, the bidirectional transliterations can be achieved with the same n-gram TM and using the same decoder.</Paragraph>
    </Section>
    <Section position="4" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.4 DOM: n-gram TM vs. NCM
</SectionTitle>
      <Paragraph position="0"> Noisy-channel model (NCM) has been well studied in the phoneme-based approach. Let's take E2C as an example to look into a bigram case to see what n-gram TM and NCM present to us under  where eqn. (4) and (5) are the bigram version of NCM and n-gram TM under DOM, respectively.</Paragraph>
      <Paragraph position="1"> The formulation of eqn. (4) could be interpreted as a HMM that has Chinese units as its hidden states and English transliteration units as the observations (Rabiner, 1989). Indeed, NCM consists of two models; one is the channel model or transliteration model,</Paragraph>
      <Paragraph position="3"> )|( , which tries to estimate the mapping probability between the two units;  )|( , which tries to estimate the generative probability of the Chinese name, given the sequence of Chinese transliteration units. Unlike NCM, n-gram TM model does not try to capture how source names can be mapped into target names, but rather how source and target names can be generated simultaneously.</Paragraph>
      <Paragraph position="4"> We can also study the two models from the contextual information usage viewpoint. One finds that eqn. (4) can be approximated by eqn. (5).  e are absent in the channel model and source model of NCM, respectively. In this way, one could argue that n-gram TM model captures more context information than traditional NCM model. With adequate and sufficient training data, n-gram TM is expected to outperform NCM in the decoding.</Paragraph>
    </Section>
    <Section position="5" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.5 Transliteration Alignment Training
</SectionTitle>
      <Paragraph position="0"> For the n-gram TM model training, the bilingual name corpus needs to be aligned firstly at the transliteration unit level. The maximum likelihood approach, through EM algorithm (Dempster et al., 1977) is employed to infer such an alignment.</Paragraph>
      <Paragraph position="1"> The aligning process is different from that of transliteration given in eqn. (1) or (2), here we have a fixed bilingual entries, a and b . The aligning process is just to find the alignment segmentation g between the two strings that maximizes the joint probability:</Paragraph>
      <Paragraph position="3"> Kneser-Ney smoothing algorithm (Chen et al., 1998) is applied to smooth the probability distribution. NCM model training is carried out in the similar way to n-gram TM. The difference between the two models lies in eqn (4) and (5).</Paragraph>
    </Section>
    <Section position="6" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
3.6 Decoding Issue
</SectionTitle>
      <Paragraph position="0"> The decoder searches for the most probabilistic path of transliteration pairs, given the word in source language, by resolving different combinations of alignments. Rather than Viterbi algorithm, we use stack decoder (Schwartz et al., 1990) to get N-best results for further processing or as output for other applications.</Paragraph>
    </Section>
  </Section>
  <Section position="8" start_page="111" end_page="111" type="metho">
    <SectionTitle>
4 The Experiments
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.1 Testing Environments
</SectionTitle>
      <Paragraph position="0"> We evaluate our method through several experiments for two language pairs: English/Chinese and English/Japanese.</Paragraph>
      <Paragraph position="1"> For English/Chinese language pair, we use a database from the bilingual dictionary &amp;quot;Chinese Transliteration of Foreign Personal Names&amp;quot; (Xinhua, 1992). The database includes a collection of 37,694 unique English entries and their official Chinese transliteration. The listing includes personal names of English, French, and many other origins. The following results for this language pair are estimated by 13-fold cross validation for more accurate. We report two types of error rates: word error rate and character error rate. In word error rate, a word is considered correct only if an exact match happens between transliteration and the reference. The character error rate is the sum of deletion, insertion and substitution errors. Only the top choice in N-best results is used for character error rate reporting.</Paragraph>
      <Paragraph position="2"> For English/Japanese language pair, we use the same database as that in the literature (Bilac et al., 2004)  . The database includes 7,021 Japanese words in katakana together with their English translation extracted from the EDICT dictionary  .</Paragraph>
      <Paragraph position="3"> 714 tokens of these entries are withheld for evaluation. Only word error rate is reported for this language pair.</Paragraph>
    </Section>
    <Section position="2" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.2 Modeling
</SectionTitle>
      <Paragraph position="0"> The alignment is done fully automatically along with the n-gram TM training process.</Paragraph>
      <Paragraph position="1">  We thank Mr. Slaven Bilac for letting us use his testing setup as a reference.</Paragraph>
      <Paragraph position="2">  ftp://ftp.cc.monash.edu.au/pub/nihongo/. # close set bilingual entries (full data) 7,021 # training entries for open test 6,307 # test entries for open test 714</Paragraph>
    </Section>
    <Section position="3" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.3 E2C Transliteration
</SectionTitle>
      <Paragraph position="0"> In this experiment, we conduct both open and closed tests for n-gram TM and NCM models under DOM paradigm. Results are reported in Table 3 and Table 4.</Paragraph>
      <Paragraph position="1">  Not surprisingly, the result shows that n-gram TM, which benefits from the joint source-channel model coupling both source and target contextual information into the model, is superior to NCM in all the test cases.</Paragraph>
    </Section>
    <Section position="4" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.4 C2E Back-Transliteration
</SectionTitle>
      <Paragraph position="0"> The C2E back-transliteration is more challenging than E2C transliteration. Experiment results are reported in Table 5. As expected, C2E error rate is much higher than that of E2C.</Paragraph>
      <Paragraph position="1">  both E2C and C2E which implies the potential of error reduction by using secondary knowledge source, such as table looking-up. The N-best error rates are also reduced greatly at 10-best level.</Paragraph>
    </Section>
    <Section position="5" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.5 Discussions of DOM
</SectionTitle>
      <Paragraph position="0"> Due to lack of standard data sets, the DOM framework is unable to make a straightforward comparison with other approaches. Nevertheless, we list some reported studies on other databases of E2C tasks in Table 7 and those of C2E tasks in Table 8 for reference purpose. In Table 7, the reference data are extracted from Table 1 and 3 of (Virga et al., 2003), where only character and Pinyin error rates are reported. The first 4 setups by Virga et al. all adopted the phoneme-based approach. In table 8, the reference data are extracted from Table 2 and Figure 4 of (Guo et al., 2004), where word error rates are reported.</Paragraph>
      <Paragraph position="1">  Since we have obtained results in character already and the character to Pinyin mapping is one-to-one in the 374 legitimate Chinese characters for transliteration in our implementation, we expect less Pinyin error than character error in Table 7.  For E2C, Table 7 shows that even with an 8 times larger database than ours, Huge MT (Big MT) test case who reports the best performance still generates 3 times Pinyin error rate than ours. For C2E, Table 8 shows that even with only 9 percent training set, our approach can still make 20 percent absolute word error rate reduction. Thus, although the experiment are done in different environments, to some extend, Table 7 and Table 8 reveal that the n-gram TM/DOM outperforms other techniques for the case of English/Chinese transliteration/back-transliteration significantly.</Paragraph>
    </Section>
    <Section position="6" start_page="111" end_page="111" type="sub_section">
      <SectionTitle>
4.6 English/Japanese Transliteration
</SectionTitle>
      <Paragraph position="0"> In this experiment, we conduct both open and closed tests for n-gram TM on English/Japanese transliteration and back-transliteration. We use the same training and testing setups as those in (Bilac et al., 2004).</Paragraph>
      <Paragraph position="1"> Table 9 reports the results from three different transliteration mechanisms. Case 1 is the 3-gram TM under DOM; Case 2 is Case 1 integrated with a dictionary lookup validation process during decoding; Case 3 is extracted from (Bilac et al., 2004). Similar to English/Chinese transliteration, one can find that J2E back-transliteration is more challenging than E2J transliteration in both open and closed cases. It is also found that word error rates are reduced greatly at 10-best level.</Paragraph>
      <Paragraph position="2"> (Bilac et al., 2004) proposed a hybrid-method of grapheme-based and phoneme-based for J2E backtransliteration, where the whole EDICT dictionary, including the test set, is used to train a LM. A LM unit is a word itself. In this way, the dictionary is used as a lookup table in the decoding process to help identify a valid choice among candidates. To establish comparison, we also integrate the dictionary lookup processing with the decoder, which is referred as Case 2 in Table 9. It is found that Case 2 presents a error reduction of 43.8%=(14.6-8.2)/14.6% for word over to those reported in (Bilac et al., 2004). Furthermore, the n-gram TM/DOM approach is rather straightforward in implementation where direct orthographical mapping could potentially handle Japanese transliteration of names of different language origins, while the issues with non-English terms are reported in (Bilac et al., 2004).</Paragraph>
      <Paragraph position="3"> The DOM framework shows us a great improvement in performance with n-gram TM being the most successful implementation.</Paragraph>
      <Paragraph position="4"> Nevertheless, NCM presents another successful implementation of DOM framework. The n-gram TM and NCM under direct orthographic mapping (DOM) paradigm simplify the process and reduce the chances of conversion errors. The experiments also show that even with much less training data, DOM are still much more superior performance than the state of art solutions.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML