File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1142_intro.xml

Size: 3,383 bytes

Last Modified: 2025-10-06 14:03:42

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1142">
  <Title>Learning Transliteration Lexicons from the Web</Title>
  <Section position="2" start_page="0" end_page="1129" type="intro">
    <SectionTitle>
1 Introduction
</SectionTitle>
    <Paragraph position="0"> In applications such as cross-lingual information retrieval (CLIR) and machine translation (MT), there is an increasing need to translate out-of-vocabulary (OOV) words, for example from an alphabetical language to Chinese. Foreign proper names constitute a good portion of OOV words, which are translated into Chinese through transliteration. Transliteration is a process of translating a foreign word into a native language by preserving its pronunciation in the original language, otherwise known as translation-bysound. null MT and CLIR systems rely heavily on bilingual lexicons, which are typically compiled manually. However, in view of the current information explosion, it is labor intensive, if not impossible, to compile a complete proper nouns lexicon. The Web is growing at a fast pace and is providing a live information source that is rich in transliterations. This paper presents a novel solution for automatically constructing an English-Chinese transliteration lexicon from the Web.</Paragraph>
    <Paragraph position="1"> Research on automatic transliteration has reported promising results for regular transliteration (Wan and Verspoor, 1998; Li et al, 2004), where transliterations follow rigid guidelines. However, in Web publishing, translators in different countries and regions may not observe common guidelines. They often skew the transliterations in different ways to create special meanings to the sound equivalents, resulting in casual transliterations. In this case, the common generative models (Li et al, 2004) fail to predict the transliteration most of the time. For example, &amp;quot;Coca Cola&amp;quot; is transliterated into &amp;quot; Ke Kou Ke Le /Ke-Kou-Ke-Le/&amp;quot; as a sound equivalent in Chinese, which literately means &amp;quot;happiness in the mouth&amp;quot;. In this paper, we are interested in constructing lexicons that cover both regular and casual transliterations.</Paragraph>
    <Paragraph position="2"> When a new English word is first introduced, many transliterations are invented. Most of them are casual transliterations because a regular transliteration typically does not have many variations. After a while, the transliterations converge into one or two popular ones. For example, &amp;quot;Taxi&amp;quot; becomes &amp;quot;De Shi /Di-Shi/&amp;quot; in China and &amp;quot; De Shi /De-Shi/&amp;quot; in Singapore. Therefore, the adequacy of a transliteration entry could be judged by its popularity and its conformity with the translation-by-sound principle. In any case, the phonetic similarity should serve as the primary basis of judgment.</Paragraph>
    <Paragraph position="3"> This paper is organized as follows. In Section 2, we briefly introduce prior works pertaining to machine transliteration. In Section 3, we propose a phonetic similarity model (PSM) for confidence scoring of transliteration. In Section 4, we propose an adaptive learning process for PSM modeling and lexicon construction. In Section 5, we conduct experiments to evaluate different adaptive learning strategies. Finally, we conclude in Section 6.</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML