File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-1111_metho.xml
Size: 10,240 bytes
Last Modified: 2025-10-06 14:09:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1111"> <Title>A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 The Model </SectionTitle> <Paragraph position="0"> Different from previous Hangeul-Hanja conversion method in Korean IMEs, our system uses statistical information in both sino-Korean word recognition and the best Hanja correspondence selection. There are two sub-models included in the model, one is Hangeul-Hanja TM, and the other one is Hanja LM. They provide a unified approach to the whole conversion processing, including compound word tokenization, sino-Korean word recognition, and the correct Hanja correspondence selection.</Paragraph> <Paragraph position="1"> Let S be a Hangeul string (block) not longer than a sentence. For any hypothesized Hanja conversion T, the task is finding the most likely T*, which is a most likely sequence of Hanja and/or Hangeul characters/words, so as to maximize the highest probability Pr(S, T): T* = argmaxTPr(S, T).</Paragraph> <Paragraph position="2"> Pr(S, T) could be transfer probability Pr(T|S) itself. And like the model in Pinyin IME (Chen and Lee, 2000), we also try to use a Hanja LM Pr(T), to measure the probabilities of hypothesized Hanja and/or Hangeul sequences. The model is also a sentence-based model, which chooses the probable Hanja/Hangeul word according to the context.</Paragraph> <Paragraph position="3"> Now the model has two parts, TM Pr(T|S), and LM Pr(T). We have:</Paragraph> <Paragraph position="5"> T is a word sequence which composed by t1, t2, ..., tn, where ti could be either Hanja or Hangeul word/character. We can see the model in equation (1) does not follow the bayes law. It is only a combination model of TM and LM, in which TM reflects transfer probability, and LM reflects context information. Using linear interpolated bigram as LM, the model in equation (1) can be rewritten as equation 2.</Paragraph> <Paragraph position="7"> Word tokenization is also a hidden process in model (2), so both T=t1, t2, ...,tn and T'=t'1,t'2,...t'm can be the correspondences of given source sentence S. In practice, a Viterbi algorithm is used to search the best T* sequence.</Paragraph> <Paragraph position="8"> We do not use the noisy channel model Pr(T|S)=argmaxTPr(S|T)Pr(T) to get T*, because most of the Hanja characters has only one Hangeul writing, so that most of the Pr(S|T) tend to be 1. So if we use the noisy channel model in Hangeul-Hanja conversion, the model would be weakened to Hanja LM Pr(T) in most of the cases.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="metho"> <SectionTitle> 4 Implementation </SectionTitle> <Paragraph position="0"> There are several factors should be considered in the model implementation. For example, we could adapt the model to character-level or word-level; we could adopt a TM weight as an interpolation coefficient, and find out the suitable weight for best result; we can also consider about utilizing Chinese corpus to try to overcome the sparsness problem of Hanja data. We can also limit the sino-Korean candidates to only noun words, or expand the candidates to noun, verb, modifier and affix and so on, to see what kind of POS-tag-restriction is better for the Hangeul-Hanja conversion.</Paragraph> <Paragraph position="1"> We adopt previous dictionary-based approach as our base-line system. To get the higher precision in the base-line experiments, we also want to check if the big dictionary or small dictionary would be better for the Hangeul-Hanja conversion.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.1 Word Level or Character Level </SectionTitle> <Paragraph position="0"> There are two kinds of levels in the model implementation. In word level implementation, the si in equation (2) is a Hangeul word. In character level implementation, si is a sequence of Hangeul characters.</Paragraph> <Paragraph position="1"> In word level implementation, there is no word tokenization after POS tagging, so unknown word or compound word is considered as one word without further tokenization. The advantage of word level implementation is, there is no noisy caused by tokenization error. Its disadvantage is that, the system is weak for the unknown and compound word conversion.</Paragraph> <Paragraph position="2"> To the contrary, in character level implementation, word tokenization are performed as a hidden process of the model. There are several reasons for why word tokenization is required even after POS tagging. First, it is because the morph analysis dictionary is different from the Hangeul-Hanja word dictionary, so the compound word in the morph dictionary still could be unknown word in Hangeul-Hanja dictionary. Second, there are some unknown words even after POS tagging, and this situation is quite serious in terminology or technical domain. Character level implementation will tokenize a given word to all possible character strings, and try to find out the best tokenization way by finding the most likely T* via equation (2). Obviously, character level implementation is better than word level implementation for unknown and compound word conversion, but it also raises the risk of bringing too much noise because of the tokenization error. We have to distinguish which one is better through the experiment.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.2 Transfer Model Weight </SectionTitle> <Paragraph position="0"> Our model in equation 2 is not derived from Bayes law. We just use the conditional probability Pr(T|S) to reflect the Hangeul-Hanja conversion possibility, and assume Hanja LM Pr(S) would be helpful for the output smoothing. The model is only a combination model, so we need a interpolation coefficient a - a TM weight, to get the best combination way of the model. Get the log of the equation, the equation (2) can be rewritten as equation (3).</Paragraph> <Paragraph position="1"> where, a = [0,1] is the TM weight.</Paragraph> <Paragraph position="2"> When a takes a value between 0 to 1, it's a combination model. When a=1, the model is a TM; and when a=0, the model is a LM.</Paragraph> <Paragraph position="3"> To the LM, we test both unigram and bigram in word level experiment. The interpolated bigram in equation (3) is used for character level implementation.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.3 Language Resource Utilization </SectionTitle> <Paragraph position="0"> There is no much Hanja data could be used for Hangeul-Hanja conversion. So we treat Hangeul-Hanja word dictionary as a Dictionary corpus, which is 5.3Mbytes in our experiment, to get unigram, bigram and transfer probability. The extracted data from dictionary is called dictionary data D.</Paragraph> <Paragraph position="1"> Second, we extract user data U from a very small user corpus (0.28Mbytes in our open test), which is in the same domain with the testing data.</Paragraph> <Paragraph position="2"> Finally, we assume that Chinese corpus is helpful for the Hangeul-Hanja conversion because of the historical relation between them, although they may not exactly the same words in the two language. We convert the code of the Hanja words to Chinese ones (GB in our experiment) to get Chinese data D (unigram and bigram) for the Hanja words from Chinese corpus, which is 270Mbytes corpus in news domain (TREC9, 2000).</Paragraph> <Paragraph position="3"> We want to know how much these different data D, U, C can help for Hangeul-Hanja conversion, and testify that through experiment.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.4 POS Tag Constraint </SectionTitle> <Paragraph position="0"> We compare two cases to see the influence of the POS tag constraint in sino-Korean recognition.</Paragraph> <Paragraph position="1"> The first case is only treat Noun as potential sino-Korean, and in the other case we extend noun to other possible POS tags, including noun, verb, modification, suffix, and affix. The sign, foreign, junction words are excluded from the potential sino-Korean candidates. It is because these words would never be sino-Korean in practice. A POS tagger is employed for the pre-processing of our system.</Paragraph> <Paragraph position="2"> Actually, most of the sino-Korean words that need Hanja writing are noun words, but in practice, the POS tagger normally shows tagging errors.</Paragraph> <Paragraph position="3"> Such kind of tagging error is much more serious in terminology and technical domain. It is one of the reasons why we want to expand the noun words to other possible POS tags. Another reason is, the more restricted the POS tag constraint is, the lower the coverage is, although the higher precision could be expected. So we should have a test to see if the constraint should be more restrict or less.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 4.5 Dictionary Size </SectionTitle> <Paragraph position="0"> We develop a dictionary-based conversion system as our base line system. This dictionary-based system follows the approach used in the previous Korean IMEs. The difference is our system uses POS tagger, and gives the best candidate for all sino-Korean words, when previous IMEs only provide all possible candidates without ranking and let user to select the correct one.</Paragraph> <Paragraph position="1"> Intuitively, the bigger the dictionary is, the better the conversion result would be. But generally, the word in bigger dictionary has more candidates, so it is still possible that bigger dictionary will low down the conversion performance. So we want to distinguish which one is better for Hangeul-Hanja in practical using.</Paragraph> <Paragraph position="2"> We used two dictionaries in the experiments, one contains 400k Hangeul-Hanja word entries, and one contains 60k Hangeul-Hanja word entries.</Paragraph> </Section> </Section> class="xml-element"></Paper>