File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/p04-1021_intro.xml

Size: 2,920 bytes

Last Modified: 2025-10-06 14:02:23

<?xml version="1.0" standalone="yes"?>
<Paper uid="P04-1021">
  <Title>A Joint Source-Channel Model for Machine Transliteration</Title>
  <Section position="2" start_page="0" end_page="0" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In applications such as cross-lingual information retrieval (CLIR) and machine translation, there is an increasing need to translate out-of-vocabulary words from one language to another, especially from alphabet language to Chinese, Japanese or Korean. Proper names of English, French, German, Russian, Spanish and Arabic origins constitute a good portion of out-of-vocabulary words. They are translated through transliteration, the method of translating into another language by preserving how words sound in their original languages. For writing foreign names in Chinese, transliteration always follows the original romanization. Therefore, any foreign name will have only one Pinyin (romanization of Chinese) and thus in Chinese characters.</Paragraph>
    <Paragraph position="1"> In this paper, we focus on automatic Chinese transliteration of foreign alphabet names. Because some alphabet writing systems use various diacritical marks, we find it more practical to write names containing such diacriticals as they are rendered in English. Therefore, we refer all foreign-Chinese transliteration to English-Chinese transliteration, or E2C.</Paragraph>
    <Paragraph position="2"> Transliterating English names into Chinese is not straightforward. However, recalling the original from Chinese transliteration is even more challenging as the E2C transliteration may have lost some original phonemic evidences. The Chinese-English backward transliteration process is also called back-transliteration, or C2E (Knight &amp; Graehl, 1998).</Paragraph>
    <Paragraph position="3"> In machine transliteration, the noisy channel model (NCM), based on a phoneme-based approach, has recently received considerable attention (Meng et al. 2001; Jung et al, 2000; Virga &amp; Khudanpur, 2003; Knight &amp; Graehl, 1998). In this paper we discuss the limitations of such an approach and address its problems by firstly proposing a paradigm that allows direct orthographic mapping (DOM), secondly further proposing a joint source-channel model as a realization of DOM. Two other machine learning techniques, NCM and ID3 (Quinlan, 1993) decision tree, also are implemented under DOM as reference to compare with the proposed n-gram TM.</Paragraph>
    <Paragraph position="4"> This paper is organized as follows: In section 2, we present the transliteration problems. In section 3, a joint source-channel model is formulated. In section 4, several experiments are carried out to study different aspects of proposed algorithm. In section 5, we relate our algorithms to other reported work. Finally, we conclude the study with some discussions.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML