File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1111_intro.xml
Size: 3,464 bytes
Last Modified: 2025-10-06 14:02:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1111"> <Title>A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Related Works </SectionTitle> <Paragraph position="0"> There are several related areas according to the tasks and approaches. First is previous Korean Hanja, Japanese Kanji (Chinese characters in Japanese language) and Chinese Pinyin input methods, the second one is English-Korean transliteration.</Paragraph> <Paragraph position="1"> Korean IME (Haansoft, 2002; Microsoft, 2002) supports word-based Hangeul-to-Hanja conversion. It provides all possible Hanja correspondences to all Hanja-related-Hangeul words in user selected range, without any candidate ranking and sino-Korean word recognition. User has to select sino-Korean words and pick out the correct Hanja correspondence. Word tokenization is performed by left-first longest match method; no context nor statistical information is considered in the correspondence providing, except last-used-first approach in one Korean IME (Microsoft, 2002).</Paragraph> <Paragraph position="2"> A multiple-knowledge-source based Hangeul-Hanja conversion method was also proposed (Lee, 1996). It was a knowledge based approach which used case-frame, noun-noun collocation, co-occurrence pattern between two nouns, last-used-first and frequency information to distinguish the sense of the sino-Korean words and select the correct Hanja correspondence for the given Hangeul writing. Lee (1996) reported that for practical using, there should be enough knowledge base, including case-frame dictionary, collocation base and co-occurrence patterns to be developed.</Paragraph> <Paragraph position="3"> There are several methods were proposed for Japanese Kana-Kanji conversion, including lastused-first, most-used-first, nearby character, collocation and case frame based approaches. The word co-occurrence pattern (Yamashita, 1988) and case-frame based approach (Abe, 1986) were reported with a quite high precision. The disadvantages include, there should be enough big knowledge-base developed before, and syntactic analyzer was required for the case frame based approach.</Paragraph> <Paragraph position="4"> Chinese Pinyin conversion is a similar task with Hangeul-Hanja conversion, except that all Pinyin syllables are converted to Chinese characters. To convert Pinyin P to Chinese characters H, Chen and Lee (2000) used Bayes law to maximize Pr(H|P), in which a LM Pr(H) and a typing model Pr(P|H) are included. The typing model reflects online typing error, and also measures if the input is an English or Chinese word. As the report, the statistical based Pinyin conversion method showed better result than the rule and heuristic based Pinyin conversion method.</Paragraph> <Paragraph position="5"> Hangeul-Hanja conversion normally does not need to convert online input. So we assume the user input is perfect, and employ a transfer model instead of the typing model in Chen and Lee (2000).</Paragraph> <Paragraph position="6"> The third related work is transliteration. In statistical based English-Korean transliteration, to convert English word E to Korean word K, a model could use Korean LM Pr(K) and TM Pr(E|K) (Lee, 1999; Kim et.al, 1999) to maximize Pr(K|E), or use English LM Pr(E) and TM Pr(K|E) to maximize Pr(E,K) (Jung et, al., 2000).</Paragraph> </Section> class="xml-element"></Paper>