File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/00/c00-1061_metho.xml

Size: 10,184 bytes

Last Modified: 2025-10-06 14:07:10

<?xml version="1.0" standalone="yes"?>
<Paper uid="C00-1061">
  <Title>English-to-Korean Transliteration using Multiple Unbounded Overlapping Phoneme Chunks</Title>
  <Section position="3" start_page="418" end_page="419" type="metho">
    <SectionTitle>
2 English-to-Korean transliteration
</SectionTitle>
    <Paragraph position="0"> E-K transliteration models are (:lassitied in two methods: the l)ivot method and the direct method. In the pivot method, transliteration is done in two steps: (:onverting English words ill|;() pronunciation symbols and then (:onverting these symbols into Kore~m wor(ts by using the Korean stm~(tard conversion rule. In the direct method, English words are directly converted to Korean words without interlnediate stct)s. An exl)eriment shows that the direct method is better than the pivot method in tin(ling wtriations of a transliteration(Lee and (~hoi, 1998). Statisti(:al information, neural network and de(:ision tree were used to imt)lelneld; the direct method.</Paragraph>
    <Section position="1" start_page="418" end_page="418" type="sub_section">
      <SectionTitle>
2.1 Statistieal Transliteration method
</SectionTitle>
      <Paragraph position="0"> An English word is divided into phoneme sequence or alphal)et sequence as (21~(22~... ~e n.</Paragraph>
      <Paragraph position="1"> Then a corresponding Korean word is rel)resented as kl, k2,... , t~:n. If n corresponding Korean character (hi) does not exist, we fill the blank with '-'. For example, an English word &amp;quot;dressing&amp;quot; and a Korean word &amp;quot;&gt; N\] zg (tuleysing)&amp;quot; are represented as Fig. 1. The ut)per one in Fig. 1 is divided into an English phoneme refit and the lower one is divided into an alphabet mlit.</Paragraph>
      <Paragraph position="2"> dressh, g :---~ ~1~</Paragraph>
      <Paragraph position="4"> The t)roblem in statistical transliteration reel;hod is to lind out the. lllOSt probable transliteration fbr a given word. Let p(K) be the 1)tel)ability of a Korean word K, then, for a given English word E, the transliteration probal)ility of a word K Call be written as P(KIE). By using the Bayes' theorem, we can rewrite the transliteration i)rol)lem as follows:</Paragraph>
      <Paragraph position="6"> As we do not know the t)rommciation of a given word, we consider all possible tfllonelne sequences, l?or exanlple, 'data' has tbllowing possible t)holmme sequences, 'd-a-t-a, d-at-a, da-ta, ...'.</Paragraph>
      <Paragraph position="7"> As the history length is lengthened, we. can get more discrimination. But long history infornlation c~mses a data sl)arseness prol)lenl. In order to solve, a Sl)arseness t)rol)len~, Ma.ximmn Entropy Model, Back-off, and Linear intert)olation methods are used. They combine different st~tistical estimators. (Tae-il Kim, 2000) use u t) to five phonemes in feature finlction(Berger et a,l., 1996). Nine %ature flmctions are combined with Maximum Entrot)y Method.</Paragraph>
    </Section>
    <Section position="2" start_page="418" end_page="419" type="sub_section">
      <SectionTitle>
2.2 Neural Network and Decision Tree
</SectionTitle>
      <Paragraph position="0"> Methods based 011 neural network and decision tree detenninistically decide a Korean character for a given English input. These methods take two or three alphabets or t)honemes as an input and generate a Korean alphabet or phoneme as an output. (Jung-.\]ae Kim, 1.999) proposed a neural network method that uses two surrom~ding t)holmmes as an intmt.</Paragraph>
      <Paragraph position="1"> (Kang, 1999) t)roposed a decision tree method that uses six surrounding alphabets. If all inl)ut does not cover the phenomena of prol)er transliterations, we cammt gel; a correct answer.  Even though we use combining methods to solve the data sparseness problem, the increase of an intbrmation length would double the complexity and the time cost of a problem. It is not easy to increase the intbrmation length. To avoid these difficulties, previous studies does not use previous outputs(ki_z). But it loses good information of target language.</Paragraph>
      <Paragraph position="2"> Our proposed method is based on the direct method to extract the transliteration and its variations. Unlike other methods that determine a certain input unit's output with history information, we increase the reliability of a certain transliteration, with known E-K transliteration t)henonmna (phoneme chunks).</Paragraph>
    </Section>
  </Section>
  <Section position="4" start_page="419" end_page="420" type="metho">
    <SectionTitle>
3 Transliteration using Multiple
</SectionTitle>
    <Paragraph position="0"> unbounded overlapping phoneme chunks For unknown data, we can estimate a Korean transliteration ti'onl hand-written rules. We can also predict a Korean transliteration with experimental intbrmation. With known English and Korean transliteration pairs, we can assume possible transliterations without linguistic knowledge. For example, 'scalar&amp;quot; has common part with 'scalc:~sqlN (suhhcyil)', ' casinoJ\[ xl (t:hacino)', 't:oala: e-l-&amp;:hoalla)', and 'car:~l-(kh.a)' (Fig. 2). We can assume possible transliteration with these words and their transliterations. From 'scale' and its transliteration l'-~\] ~ (sukheyil), the 'sc' in 'scalar' can be transliterated as '~:-J(sukh)'. From a 'casino' example, the 'c' has nlore evidence that can be transliterated as 'v (kh)'. We assume that we can get a correct Korean transliteration, if we get useful experinlental information and their proper weight that represents reliability.</Paragraph>
    <Section position="1" start_page="419" end_page="419" type="sub_section">
      <SectionTitle>
3.1 The alignment of an English word
</SectionTitle>
      <Paragraph position="0"> with a Korean word We can align an English word with its transliteration in alphabet unit or in phoneme unit. Korean vowels are usually aligned with English vowels and Korean consonants are aligned with English consonants. For example, a Korean consonant, '1~ (p)' can be aligned with English consonants 'b', 'p', and 'v'. With this heuristic we can align an English word with its transliteration in an alphabet unit and a t)honeIne unit with the accuracy of 99.4%(Kang, 1999).</Paragraph>
      <Paragraph position="1"> s c a 1 a r s c a 1 e</Paragraph>
    </Section>
    <Section position="2" start_page="419" end_page="420" type="sub_section">
      <SectionTitle>
3.2 Extraction of Phoneme Chunks
</SectionTitle>
      <Paragraph position="0"> From aligned training data, we extract phoneme clumks. We emmw.rate all possible subsets of the given English-Korean aligned pair. During enumerating subsets, we add start and end position infbrmation. From an aligned data &amp;quot;dressing&amp;quot; and &amp;quot;~etl N (tuleysing)&amp;quot;, we can get subsets as Table 12.</Paragraph>
      <Paragraph position="1">  The context stands tbr a given English alphabets, and the output stands for its transliteration. We assign a proper weight to each phoneme chunk with Equation 4.</Paragraph>
      <Paragraph position="3"> C(x) means tile frequency of z in training data.</Paragraph>
      <Paragraph position="4"> Equation 4 shows that the ambiguous phenomenon gets the less evidence. The clnmk weight is transmitted to each phoneme symbol.</Paragraph>
      <Paragraph position="5"> To compensate for the length of phoneme, we multiply the length of phoneme to the weight of the phoneme chunk(Fig. 3).</Paragraph>
      <Paragraph position="6"> 2@ means the start and end position of a word</Paragraph>
      <Paragraph position="8"> This chunk weight does not mean the. reliat)ility of a given transliteration i)henomenon.</Paragraph>
      <Paragraph position="9"> We know real reliM)itity, after all overlapping phonenm chunks are applied. The chunk that has some common part with other chunks gives a context information to them. Therefore a chunk is not only an int)ut unit but also a means to (-Mculate the reliability of other dmnks.</Paragraph>
      <Paragraph position="10"> We also e, xl;ra(:t the connection information. From Migned training (b:~ta, we obtain M1 possible combinations of Korem~ characters and English chara(:ters. With this commction intbrmation, we exclude iml)ossit)h; connections of Korean characters ~md English phon(;nte sequences. We can gel; t;he following (:ommction information from &amp;quot;dressing&amp;quot; examph'.(~12fl)le 2).</Paragraph>
    </Section>
    <Section position="3" start_page="420" end_page="420" type="sub_section">
      <SectionTitle>
3.3 A Transliteration Network
</SectionTitle>
      <Paragraph position="0"> For a given word, we get all t)ossil)h~ t)honemes and make a Korean transliteration network.</Paragraph>
      <Paragraph position="1"> Each node in a net;work has an English t)honent(; and a ('orrcspondillg Korean character. Nodes are comm(:ted with sequence order. For example, 'scalar' has the Kore, an transliteration network as Fig. 4. In this network, we dis('ommct some no(les with extracted (:onne('tion infornlation. null After drawing the Korean tr~msliteration network, we apply all possible phone, me, chunks to the. network. Each node increases its own weight with the weight of t)honeme symbol in a phoneme chunks (Fig. 5). By overlapping the weight, nodes in the longer clmnks get; more evidence. Then we get the best t)ath that has the</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="420" end_page="421" type="metho">
    <SectionTitle>
4 E-K back-transliteration
</SectionTitle>
    <Paragraph position="0"> E-K back transliteration is a more difficult prot)lem thtnt F,-K trmlsliteration. During the E-K trm~slit;cra|;ion~ (lifli'xent alphabets are treated cquiw~h'.ntly. \],~)r exmnph'., ~f, t / mM ~v~ b' spectively and the long sound and the short strand are also treated equivalently. Therefim', the number of possible English phone, rues per a Korean character is bigger than the number of Korean characters per an English phoneme.</Paragraph>
    <Paragraph position="1"> The ambiguity is increased. In E-K backtransliteration, Korean 1)honemes and English phoneme, s switch their roles. Just switching the position. A Korean word ix Migned with an English word in a phoneme unit or a character refit (Fig. 6).</Paragraph>
    <Paragraph position="3"/>
  </Section>
class="xml-element"></Paper>
Download Original XML