File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/04/w04-1111_evalu.xml
Size: 11,972 bytes
Last Modified: 2025-10-06 13:59:13
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1111"> <Title>A Statistical Model for Hangeul-Hanja Conversion in Terminology Domain</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 5 Experiment </SectionTitle> <Paragraph position="0"> This chapter shows the experiments on the model in equation 3 and some different implementations we have discussed above.</Paragraph> <Paragraph position="1"> There are two parts in the experiments, first one is mostly related to word level model implementation, in which the basic issues like language resource utilization and POS tag restriction, and some word level related issues like bigram or unigram for LM in word level are tested. The second part is mostly character level related. Several evaluation standards are employed in the experiments. The adopted standards and evaluation approaches are reported in the first section of the experiments.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.1 Evaluation Standard and Approach </SectionTitle> <Paragraph position="0"> We use several evaluation standards in the experiments. To reflect the readability from the user viewpoint, we adopt word and phrase (sentence) level accuracy, precision and recall; to compare the automatic conversion result with the standard result - from the developer viewpoint, Dice-coefficient based similarity calculation is employed also; to compare with previous Chinese Pinyin input method, a character based accuracy evaluation is also adopted.</Paragraph> <Paragraph position="1"> An automatic evaluation and analysis system is developed to support large scale experiments. The system compares the automatic result to the standard one, and performs detailed error analysis using a decision tree.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.2 Word Level Experiment </SectionTitle> <Paragraph position="0"> In this part, the basic issues like language resource utilization and POS tag restriction, and the word level related issues, like bigram or unigram for LM are performed.</Paragraph> <Paragraph position="1"> The objects of the first experiment are, firstly, compare a simple LM based statistical approach with the base line - dictionary based approach; secondly, see if large dictionary is better than small dictionary in dictionary based conversion; thirdly, see if Chinese corpus does help to the Hangeul-Hanja conversion.</Paragraph> <Paragraph position="2"> A small dictionary based conversion (Dic), large dictionary based conversion (BigDic), a unigram (Unigram) and a bigram based (Bigram) word level conversion, are performed to compared to the each other.</Paragraph> <Paragraph position="3"> The small dictionary Dic has 56,000 Hangeul-Hanja entries; while the large dictionary BigDic contains 280,000 Hangeul-Hanja entries. The unigram and bigram are extracted from Chinese data C. The test set is a small test set with 90 terms (180 content words) from terminology domain.</Paragraph> <Paragraph position="4"> Word level precision and recall with F1-measure are employed as evaluation standard.</Paragraph> <Paragraph position="5"> Statistical approach (unigram vs. bigram) From the result shows in table 1, we can get the conclusions that, 1) compare to the small dictionary, large dictionary reaches better F1-measure because of the enhancement in recall, although the precision is slightly low downed because of more Hanja candidates for given Hangeul entry; 2) Statistical approach shows obvious better result than the dictionary based approach, although it is only a very simple LM; 3) Chinese data does help to the Hangeul-Hanja conversion. We have to evaluation its impact by comparing it with other Hanja data in further experiments. 4) Bigram shows similar result with unigram in word level conversion, it shows that data sparseness problem is still very serious.</Paragraph> <Paragraph position="6"> The objects of the second experiment include the evaluation on different POS tag constraints and the comparison between different language resources.</Paragraph> <Paragraph position="7"> First is evaluation on different POS tag constraints. Let the system employs unigram based Hangeul-Hanja conversion approach, which uses dictionary data D (word unigram from large dictionary at here). Our experiment wants to compare the case of only considering noun as potential sino-Korean words (&quot;Dn&quot; in table 2), with the case of extending the POS tags to verb, modification and affix (&quot;De&quot; in table 2). Second evaluation is comparison on different language resources. As we have mentioned above, D is data from large dictionary (word unigram is used at here), U is data from very small user corpus, and C is data from Chinese corpus. We want to compare the different combination of these language resources. In the second evaluation, extended POS tag constraint is employed.</Paragraph> <Paragraph position="8"> The experiment uses a test set with 5,127 terms (12,786 content words, 4.67 Hanja candidates per sino-Korean word in average) in computer science and electronic engineering domain. User data U is from user corpus, which is the same with the test set at here (so it is a closed test). In evaluation, a dice-coefficient based similarity evaluation standard is employed.</Paragraph> <Paragraph position="9"> evaluation From the table 2, we can see that, 1) the extended POS tag constraint (&quot;De&quot; in table 2) shows better result than the noun POS tag constraint (&quot;Dn&quot;); 2) User data U shows better result than dictionary data D (&quot;U&quot; a18 &quot;De&quot;, &quot;UC&quot; a18 &quot;DC&quot; in table 2), and dictionary data D shows better result than Chinese data C (&quot;De&quot; a18 &quot;C&quot;), although Chinese corpus (where C is from) is 270MB, and much larger than the Hangeul-Hanja dictionary (5.3MB here, where D is from). It shows that the effect of Chinese data is quite limited in despite of its usefulness.</Paragraph> <Paragraph position="10"> The object of the third experiment is to find out which TM weight a is better for the word model.</Paragraph> <Paragraph position="11"> Let a to be 0, 0.5, 1, and so the model in equation (3) is LM, combined model, and TM, with the same environment of the second experiment, we get the result in table 3. Word level precision and recall with F1-measure is evaluated. We can see the TM with a=1 shows the best result.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.3 Character Level Experiment </SectionTitle> <Paragraph position="0"> In the character level experiments, first, we compare the character level model with base line dictionary based approach; Second, compare the character level model with the word level model; Third, to find out the best TM weight for the character level model.</Paragraph> <Paragraph position="1"> This part of experiments uses a new test set, which has 1,000 terms in it (2,727 content words;</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.9 Hanja candidates per sino-Korean word in </SectionTitle> <Paragraph position="0"> average). The user data U has 12,000 Hangeul-Hanja term pairs in it. U is from the same domain of the test set (computer science and electronic engineering domain at here), but there is no overlap with the test set (so it is a opened test).</Paragraph> <Paragraph position="1"> Several different evaluation standards are employed. As the first column of table 4, &quot;CA&quot;, &quot;WA&quot; and &quot;SA&quot; mean character, word, sentence (terms) accuracy, respectively. &quot;Sim&quot; is the similarity based evaluation, and F1 is the value of word level F1-measure which is from word precision/recall evaluation.</Paragraph> <Paragraph position="2"> The first row of table 4 shows the Hangeul-Hanja conversion approach with the employed data and TM weight a. &quot;Dic&quot; is the base line dictionary based approach; &quot;w&quot; means word level model; &quot;D&quot; means dictionary data (extracted from the large dictionary with 400,000 Hangeul-Hangeul and Hangeul-Hanja entries), U means user data described above, C means Chinese data. The digital value like &quot;.5&quot; is TM weight. So, as an example, &quot;wDUC1&quot; means word model with a=1 and using all data resources D, U and C; &quot;DU.2&quot; means character model with a=0.2 and using data D and U.</Paragraph> <Paragraph position="3"> From the table 4, we can get the conclusions that, 1) all statistical model based approaches shows obviously better performance than the base line dictionary based approach &quot;Dic&quot; (Dic a18 others). 2) In most cases, character models show better results than word model (DUx a18 wDUCw1). But when there is no user data, word mode is better than character model (wD1a18 D.5). 3) Among character models, the TM with a=1 shows the best result (&quot;DU1&quot; a18 &quot;DU.x&quot;). 4) User data has positive impact on the performance (&quot;Dw1 a18 DUCw1&quot;, &quot;D.5 a18 DU.5&quot;), and it is especially important to the character model (&quot;D.5 a18 DU.5&quot;). It is because character model may cause more noise because of word tokenization error when there is no user data.</Paragraph> <Paragraph position="4"> From the table 4, we can see the best result is gotten from character based TM with using dictionary and user data D, U (&quot;DU1&quot;). The best character accuracy is 91.4%, when the word accuracy is 81.4%. The character accuracy is lower than the typing and language model based Chinese Pinyin IME, which was 95% in Chen & Lee (2000).</Paragraph> <Paragraph position="5"> But consider that in our experiment, there is almost no Hanja data except dictionary, and also consider the extra difficulty from terminology domain, this comparision result is quite understandable. Our experiment also shows that, compare to using only LM like it in Chen & Lee (2000), TM shows significantly better result in character accuracy (from 81.0% to 91.4% in our experiment: &quot;DU0&quot; a18 &quot;DU1&quot;, table 4). Our user evaluation also shows that, to the terminology domain, the automatic conversion result from the system shows even better quality than the draft result from untrained human translator.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 5.4 Different Evaluation Standards </SectionTitle> <Paragraph position="0"> Figure 1 shows the trends of different evaluation standards in the same experiment shown in table 4.</Paragraph> <Paragraph position="1"> We can see character accuracy &quot;CA&quot; shows similar trend with similarity based standard &quot;Sim&quot;, while word accuracy &quot;WA&quot; and sentence (terms) accuracy &quot;SA&quot; show similar trends with F1-measure &quot;F1&quot;, in which &quot;F1&quot;is based on word precision and recall.</Paragraph> <Paragraph position="2"> From the user viewpoint, word/sentence accuracy and F1-measure reflects readability better than character accuracy. It is because, if there is a character wrongly converted in a word, it affects the readability of whole word but not only that character's. However, character accuracy is more important to the system evaluation, especially for the character level model implementation. It is because the character accuracy can reflect the system performance in full detail than the word or sentence (term) based one.</Paragraph> </Section> </Section> class="xml-element"></Paper>