File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/w03-0317_metho.xml
Size: 18,847 bytes
Last Modified: 2025-10-06 14:08:19
<?xml version="1.0" standalone="yes"?> <Paper uid="W03-0317"> <Title>Acquisition of English-Chinese Transliterated Word Pairs from Parallel- Aligned Texts using a Statistical Machine Transliteration Model</Title> <Section position="3" start_page="0" end_page="1" type="metho"> <SectionTitle> 2 Statistical Machine Transliteration </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1" type="sub_section"> <SectionTitle> Model 2.1 Overview of the Noisy Channel Model </SectionTitle> <Paragraph position="0"> Machine transliteration can be regarded as a noisy channel, as illustrated in Figure 1. Briefly, the language model generates a source word E and the transliteration model converts the word E to a target transliteration C.</Paragraph> <Paragraph position="1"> Then, the channel decoder is used to find the word E that is the most likely to the word E that gives rise to the Under the noisy channel model, the back-transliteration problem is to find out the most probable word E, given transliteration C. Letting P(E) be the probability of a word E, then for a given transliteration C, the back-transliteration probability of a word E can be written as P(E|C). By Bayes' rule, the transliteration problem can be written as follows: The first term, P(E), in Eq. (2) is the language model, the probability of E. The second term, P(C|E), in Eq. (2) is the transliteration model, the probability of the transliteration C conditioned on E.</Paragraph> <Paragraph position="2"> Below, we assume that E is written in English, while C is written in Chinese. Since Chinese and English are not in the same language family, there is no simple or direct way of mapping and comparison. One feasible solution is to adopt a Chinese romanization system to represent the pronunciation of each Chinese character. Among the many romanization systems for Chinese, Wade-Giles and Pinyin are the most widely used. The Wade-Giles system is commonly used in Taiwan today and has traditionally been popular among Western scholars. For this reason, we use the Wade-Giles system to romanize Chinese characters. However, the proposed approach is equally applicable to other romanization systems.</Paragraph> <Paragraph position="3"> The language model gives the prior probability P(E) which can be modeled using maximum likelihood estimation. As for the transliteration model P(C|E), we can approximate it using the transliteration unit (TU), which is a decomposition of E and C. TU is defined as a se- null Ref. sites: &quot;http://www.romanization.com/index.html&quot; and &quot;http://www.edepot.com/taoroman.html&quot;.</Paragraph> <Paragraph position="4"> quence of characters transliterated as a base unit. For English, a TU can be a monograph, a digraph, or a trigraph (Wells, 2001). For Chinese, a TU can be a syllable initial, a syllable final, or a syllable (Chao, 1968) represented by corresponding romanized characters. To illustrate how this approach works, take the example of an English name, &quot;Smith&quot;, which can be segmented into four TUs and aligned with the romanized transliteration. Assuming that the word is segmented into &quot;S-m-i-th&quot;, then a possible alignment with the Chinese transliteration &quot;Shi Mi Si (Shihmissu)&quot; is depicted in Figure 2. Chinese romanized character sequences.</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.2 Formal Description: Statistical Translitera- </SectionTitle> <Paragraph position="0"> tion Model (STM) A word E with l characters and a romanized word C with n characters are denoted by</Paragraph> <Paragraph position="2"> tively. Assume that the number of aligned TUs for (E, C) is N, and let</Paragraph> <Paragraph position="4"> be an alignment candidate, where m j is the match type of the j-th TU. The match type is defined as a pair of TU lengths for the two languages. For instance, in the case of (Smith, Shihmissu), N is 4, and M is {1-4, 1-1, 1-1, 2-3}. We write E and C as follows:</Paragraph> <Paragraph position="6"> are the i-th TU of E and the j-th TU of C, respectively.</Paragraph> <Paragraph position="7"> Then the probability of C given E, P(C|E), is formulated as follows:</Paragraph> <Paragraph position="9"> To reduce computational complexity, one alternative approach is to modify the summation criterion in Eq. (4) into maximization. Therefore, we can approximate</Paragraph> <Paragraph position="11"> Let ),( jiS be the maximum accumulated log probability between the first i characters of E and the first j characters of C. Then, ),()|(log nlSECP = , the maximum accumulated log probability among all possible alignment paths of E with length l and C with length n, can be computed using a dynamic programming (DP) strategy, as shown in the following: where ),( khP is defined as the probability of the match type &quot;h-k&quot;.</Paragraph> </Section> <Section position="3" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 2.3 Estimation of Model Parameters </SectionTitle> <Paragraph position="0"> To describe the iterative procedure for re-estimation of probabilities of )|(</Paragraph> <Paragraph position="2"> the following functions:</Paragraph> <Paragraph position="4"> with length h aligned with v j with length k in the training set. Therefore, the translation probability )|(</Paragraph> <Paragraph position="6"> The probability of the match type, ),( khP , can be estimated as follows:</Paragraph> <Paragraph position="8"> beginning , a reasonable initial estimate of the parameters of the translation model is to constrain the TU alignments of a word pair (E, C) within a position distance d (Lee and Choi, 1997). Assume that</Paragraph> <Paragraph position="10"> where l and n are the length of the source word E and the target word C, respectively.</Paragraph> <Paragraph position="11"> To accelerate the convergence of EM training and reduce the noisy TU aligned pairs (u</Paragraph> <Paragraph position="13"> ), we restrict the combination of TU pairs to limited patterns. Consonant TU pairs only with same or similar phonemes are allowed to be matched together. An English consonant is also allowed to matching with a Chinese syllable beginning with same or similar phonemes. An English semivowel TU can either be matched with a Chinese consonant or a vowel with same or similar phonemes, or be matched with a Chinese syllable beginning with same or similar phonemes.</Paragraph> <Paragraph position="14"> As for the probability of match type, ),( khP , it is set to uniform distribution in the initialization phase, shown as follows:</Paragraph> <Paragraph position="16"> where T is the total number of match types allowed.</Paragraph> <Paragraph position="17"> Based on the Expectation Maximization (EM) algorithm (Dempster et al., 1977) with Viterbi decoding (Forney, 1973), the iterative parameter estimation procedure is described as follows: Step 1 (Initialization): Use Eq. (13) to generate likely TU alignment pairs. Calculate the initial model parameters, Based on current model parameters, find the best Viterbi path for each E and C word pair in the training set.</Paragraph> <Paragraph position="18"> Step 3 (Maximization): Based on all the TU alignment pairs obtained from Step 2, calculate the new model parameters using Eqs. (11) and (12). Replace the model parameters with the new model parameters. If it reaches a stopping criterion or a pre-defined iteration numbers, then stop the training procedure. Otherwise, go back to Step The task of machine transliteration is useful for many NLP applications, and one interesting related problem is how to find the corresponding transliteration for a given source word in a parallel corpus. We will describe how to apply the proposed model for such a task.</Paragraph> <Paragraph position="19"> For that purpose, a sentence alignment procedure is applied first to align parallel texts at the sentence level. Then, we use a tagger to identify proper nouns in the source text. After that, the model is applied to isolate the transliteration in the target text. In general, the proposed transliteration model could be further augmented with linguistic processing, which will be described in more details in the next subsection. The overall process is summarized in Figure 3.</Paragraph> <Paragraph position="20"> In the above excerpt, three English proper nouns &quot;Jaenisch&quot;, &quot;Whitehead&quot;, and &quot;Massachusetts&quot; are identified by a tagger. Utilizing Eqs. (7) and the DP approach formulated by Eqs. (8)-(10), we found the target word &quot;huaihaite ( Huai Hai De )&quot; most likely corresponding to &quot;Whitehead&quot;. In order to retrieve the transliteration for a given proper noun, we need to keep track of the optimal TU decoding sequence associated with the given Chinese term for each word pair under the proposed method. The aligned TUs can be easily obtained via backtracking the best Viterbi path (Manning and Schutze, 1999). For the example mentioned above, the alignments of the TU matching pairs via the Viterbi backtracking path are illustrated in Figure 4.</Paragraph> <Paragraph position="21"> via the Viterbi backtracking path.</Paragraph> </Section> <Section position="4" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.1 Linguistic Processing </SectionTitle> <Paragraph position="0"> Some language-dependent knowledge can be integrated to further improve the performance, especially when we focus on specific language pairs.</Paragraph> <Paragraph position="1"> Linguistic Processing Rule 1 (R1): Some source words have both transliteration and translation, which are equally acceptable and can be used interchangeably. For example, the source word &quot;England&quot; is translated into &quot;Ying Guo (Yingkou)&quot; and transliterated into &quot;Ying Ge Lan (Yingkolan)&quot;, respectively, as shown in Figure 5. Since the proposed model is designed specifically for transliteration, such cases may cause problems. One way to overcome this limitation is to handle those cases by using a list of commonly used proper names and translations.</Paragraph> <Paragraph position="2"> From error analysis of the aligned results of the training set, the proposed approach suffers from the fluid TUs, such as &quot;t&quot;, &quot;d&quot;, &quot;tt&quot;, &quot;dd&quot;, &quot;te&quot;, and &quot;de&quot;. Sometimes they are omitted in transliteration, and sometimes they are transliterated as a Chinese character. For instance, &quot;d&quot; is usually transliterated into &quot;Te &quot;, &quot;De &quot;, or &quot;De &quot; corresponding to Chinese TU of &quot;te&quot;. The English TU &quot;d&quot; is transliterated as &quot;De &quot; in (Clifford, Ke Li Fu De ), but left out in (Radford, Lei De Fu ). In the example shown in word extraction for &quot;David&quot;.</Paragraph> <Paragraph position="3"> However, that problem caused by fluid TUs can be partly overcome by adding more linguistic constraints in the post-processing phase. We calculate the Chinese character distributions of proper nouns from a bilingual proper name list. A small set of Chinese characters is often used for transliteration. Therefore, it is possible to improve the performance by pruning extra tailing characters, which do not belong to the transliterated character set, from the transliteration candidates. For instance, the probability of &quot;De , Qu , Shuo , Shi , You &quot; being used in transliteration is very low. So correct transliteration &quot; Da Wei &quot; for the source word &quot;David&quot; could be extracted by removing the character &quot;De &quot;.</Paragraph> </Section> <Section position="5" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 3.2 Working Flow by Integrating Linguistic and Statistical Information </SectionTitle> <Paragraph position="0"> Combining the linguistic processing and transliteration model, we present the algorithm for transliteration extraction as follows: Step 1: Look up the translation list as stated in R1. If the translation of a source word appears in both the entry of the translation list and the aligned target sentence (or paragraph), then pick the translation as the target word. Otherwise, go to Step 2.</Paragraph> <Paragraph position="1"> Step 2: Pass the source word and its aligned target sentence (or paragraph) through the proposed model to extract the target word.</Paragraph> <Paragraph position="2"> Step 3: Apply linguistic processing R2 to remove superfluous tailing characters in the target word.</Paragraph> <Paragraph position="3"> After the above processing, the performance of source-target word extraction is significantly improved over the previous experiment.</Paragraph> </Section> </Section> <Section position="4" start_page="1" end_page="1" type="metho"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"> In this section, we focus on the setup of experiments and performance evaluation for the proposed model.</Paragraph> <Section position="1" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.1 Experimental Setup </SectionTitle> <Paragraph position="0"> The corpus T0 for training consists of 2,430 pairs of English names together with their Chinese transliterations. Two experiments are conducted. In the first experiment, we analyze the convergence characteristics of this model training based on a similarity-based framework (Chen et al., 1998; Lin and Chen, 2002). A validation set T1, consisting of 150 unseen person name pairs, was collected from Sinorama Magazine (Sinorama, 2002). For each transliterated word in T1, a set of 1,557 proper names is used as potential answers. In the second experiment, a parallel corpus T2 was prepared to evaluate the performance of proposed methods. T2 consists of 500 bilingual examples from the English-Chinese version of the Longman Dictionary of Contempory English (LDOCE) (Proctor, 1988).</Paragraph> </Section> <Section position="2" start_page="1" end_page="1" type="sub_section"> <SectionTitle> 4.2 Evaluation Metric </SectionTitle> <Paragraph position="0"> In the first experiment, a set of source words was compared with a given target word, and then was ranked by similarity scores. The source word with the highest similarity score is chosen as the answer to the back-transliteration problem. The performance is evaluated by rates of the Average Rank (AR) and the Average Reciprocal Rank (ARR) following Voorhees and Tice (2000).</Paragraph> <Paragraph position="1"> where N is the number of testing data, and R(i) is the rank of the i-th testing data. Higher values of ARR indicate better performance.</Paragraph> <Paragraph position="2"> the validation set T1.</Paragraph> <Paragraph position="3"> In Figure 7, we show the rates of AR and ARR for the validation set T1 by varying the number of iterations of the EM training algorithm from 1 to 6. We note that the rates become saturated at the 2nd iteration, which indicates the efficiency of the proposed training approach. As for the second experiment, performance on the extraction of transliterations is evaluated based on precision and recall rates on the word and character level. Since we consider exact one proper name in the source language and one transliteration in the target language at a time. The word recall rates are same as word precision For the purpose of easier evaluation, T2 was designed to contain exact one proper name in the source language and one transliteration in the target language for each bilingual example. Therefore, if more than one proper name occurs in a bilingual example, we separate them into several testing examples. We also separate a compound proper name in one example into individual names to form multiple examples. For example, in the first case, two proper names &quot;Tchaikovsky&quot; and &quot;Stravinsky&quot; were found in the testing sample &quot;Tchaikovsky and Stravinsky each wrote several famous ballets&quot;. In the second case, a compound proper name &quot;Cyril Tourneur&quot; was found in &quot;No one knows who wrote that play, but it is usually ascribed to Cyril Tourneur&quot;. However, in the third case, &quot;New York&quot; is transliterated as a whole Chinese word &quot;Niu Yue &quot;, so it can not be separated into two words. Therefore, the testing data for the above examples will be semi-automatically constructed. For simplicity, we considered each proper name in the source sentence in turn and determined its corresponding transliteration independently. Table 1 shows some examples of the testing set T2.</Paragraph> <Paragraph position="4"> In the experiment of transliterated word extraction, the proposed method achieves on average 86.0% word accuracy rate, 94.4% character precision rate, and 96.3% character recall rate, as shown in row 1 of Table 2. The performance can be further improved with a simple statistical and linguistic processing, as shown in ated word extraction for T2.</Paragraph> <Paragraph position="5"> In the baseline model, we find that there are some errors caused by translations which are not strictly transliterated; and there are some source words transferred into target words by means of transliteration and translation mutually. Therefore, R1 can be viewed as the pre-processing to extract transliterated words. Some errors are further eliminated by R2 which considers the usage of the transliterated characters in the target language. In this experiment, we use a transliterated character set of</Paragraph> </Section> </Section> <Section position="5" start_page="1" end_page="1" type="metho"> <SectionTitle> 735 Chinese characters. 5 Conclusion </SectionTitle> <Paragraph position="0"> In this paper, we describe a framework to deal with the problem of acquiring English-Chinese bilingual transliterated word pairs from parallel-aligned texts. An unsupervised learning approach to the proposed machine transliteration model is also presented. The proposed approach automatically learned the parameters of the model from a bilingual proper name list. It is not restricted to the availability of pronunciation dictionary in the source language. From the experimental results, it indicates that our methods achieve excellent performance. With the statistical-based characteristic of the proposed model, we plan to extend the experiments to bi-directional transliteration and other different corpora.</Paragraph> </Section> class="xml-element"></Paper>