File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/06/p06-1142_relat.xml

Size: 5,252 bytes

Last Modified: 2025-10-06 14:15:57

<?xml version="1.0" standalone="yes"?>
<Paper uid="P06-1142">
  <Title>Learning Transliteration Lexicons from the Web</Title>
  <Section position="3" start_page="1129" end_page="1129" type="relat">
    <SectionTitle>
2 Related Work
</SectionTitle>
    <Paragraph position="0"> In general, studies of transliteration fall into two categories: transliteration modeling (TM) and extraction of transliteration pairs (EX) from corpora.</Paragraph>
    <Paragraph position="1"> The TM approach models phoneme-based or grapheme-based mapping rules using a generative model that is trained from a large bilingual lexicon, with the objective of translating unknown words on the fly. The efforts are centered on establishing the phonetic relationship between transliteration pairs. Most of these works are devoted to phoneme1-based transliteration modeling (Wan and Verspoor 1998, Knight and Graehl, 1998). Suppose that EW is an English word and CW is its prospective Chinese transliteration. The phoneme-based approach first converts EW into an intermediate phonemic representation P, and then converts the phonemic representation into its Chinese counterpart CW. In this way, EW and CW form an E-C transliteration pair.</Paragraph>
    <Paragraph position="2"> In this approach, we model the transliteration using two conditional probabilities, P(CW|P) and P(P|EW), in a generative model P(CW|EW) = P(CW|P)P(P|EW). Meng (2001) proposed a rule-based mapping approach. Virga and Khudanpur (2003) and Kuo et al (2005) adopted the noisy-channel modeling framework. Li et al (2004) took a different approach by introducing a joint source-channel model for direct orthography mapping (DOM), which treats transliteration as a statistical machine translation problem under monotonic constraints. The DOM approach, which is a grapheme-based approach, significantly outperforms the phoneme-based approaches in regular transliterations. It is noted that the state-of-the-art accuracy reported by Li et al (2004) for regular transliterations of the Xinhua database is about 70.1%, which leaves much room for improvement if one expects to use a generative model to construct a lexicon for casual transliterations.</Paragraph>
    <Paragraph position="3"> EX research is motivated by information retrieval techniques, where people attempt to extract transliteration pairs from corpora. The EX approach aims to construct a large and up-to-date transliteration lexicon from live corpora. Towards this objective, some have proposed extracting translation pairs from parallel or comparable bitext using co-occurrence analysis 1 Both phoneme and syllable based approaches are referred to as phoneme-based here.</Paragraph>
    <Paragraph position="4"> or a context-vector approach (Fung and Yee, 1998; Nie et al, 1999). These methods compare the semantic similarities between words without taking their phonetic similarities into accounts. Lee and Chang (2003) proposed using a probabilistic model to identify E-C pairs from aligned sentences using phonetic clues. Lam et al (2004) proposed using semantic and phonetic clues to extract E-C pairs from comparable corpora. However, these approaches are subject to the availability of parallel or comparable bitext. A method that explores non-aligned text was proposed by harvesting katakana-English pairs from query logs (Brill et al, 2001). It was discovered that the unsupervised learning of such a transliteration model could be overwhelmed by noisy data, resulting in a decrease in model accuracy.</Paragraph>
    <Paragraph position="5"> Many efforts have been made in using Web-based resources for harvesting transliteration/ translation pairs. These include exploring query logs (Brill et al, 2001), unrelated corpus (Rapp, 1999), and parallel or comparable corpus (Fung and Yee, 1998; Nie et al, 1999; Huang et al 2005). To establish correspondence, these algorithms usually rely on one or more statistical clues, such as the correlation between word frequencies, cognates of similar spelling or pronunciations. They include two aspects. First, a robust mechanism that establishes statistical relationships between bilingual words, such as a phonetic similarity model which is motivated by the TM research; and second, an effective learning framework that is able to adaptively discover new events from the Web. In the prior work, most of the phonetic similarity models were trained on a static lexicon. In this paper, we address the EX problem by exploiting a novel Web-based resource. We also propose a phonetic similarity model that generates confidence scores for the validation of E-C pairs.</Paragraph>
    <Paragraph position="6"> In Chinese webpages, translated or transliterated terms are frequently accompanied by their original Latin words. The latter serve as the appositives of the former. A sample search result for the query submission &amp;quot;Kuro&amp;quot; is the bilingual snippet2 &amp;quot;...Jing Ying KuroKu Luo P2PYin Le Jiao Huan Ruan Ti De Fei Xing Wang ,3 Ri Fa Biao P2P Yu Ban Quan Zheng Yi De Jie Jue Fang An -- C2C (Content to Community)...&amp;quot;. The co-occurrence statistics in such a snippet was shown to be useful in constructing a transitive translation model (Lu et al, 2002). In the</Paragraph>
  </Section>
class="xml-element"></Paper>
Download Original XML