File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-1809_intro.xml
Size: 9,609 bytes
Last Modified: 2025-10-06 14:02:38
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-1809"> <Title>Term Extraction from Korean Corpora via Japanese</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Overview </SectionTitle> <Paragraph position="0"> Figure 1 exemplifies our extraction method, which produces a Japanese-Korean bilingual lexicon using a Korean corpus and Japanese corpus and/or lexicon. The Japanese and Korean corpora do not have to be parallel or comparable. However, it is desirable that both corpora are associated with the same domain. For the Japanese resource, the corpus and lexicon can alternatively be used or can be used together. Note that compiling Japanese monolingual lexicon is less expensive than that for a bilingual lexicon. In addition, new Katakana words can easily be extracted from a number of on-line resources, such as the World Wide Web. Thus, the use of Japanese lexicons does not decrease the utility of our method.</Paragraph> <Paragraph position="1"> First, we collect Katakana words from Japanese resources. This can systematically be performed by means of a Japanese character code, such as EUC-JP and SJIS.</Paragraph> <Paragraph position="2"> Second, we represent the Korean corpus and Japanese Katakana words by the Roman alphabet (i.e., romanization), so that the phonetic similarity can easily be computed. However, we use different romanization methods for Japanese and Korean.</Paragraph> <Paragraph position="3"> CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology 71 Third, we extract candidates of foreign words from the romanized Korean corpus. An alternative method is to first perform morphological analysis on the corpus, extract candidate words based on morphemes and parts-of-speech, and romanize the extracted words. Our general model does not constrain as to which method should be used in the third step. However, because the accuracy of analysis often decreases for new words to be extracted, we experimentally adopt the former method.</Paragraph> <Paragraph position="4"> Finally, we compute the phonetic similarity between each combination of the romanized Hangul and Katakana words, and select the combinations whose score is above a predefined threshold. As a result, we can obtain a Japanese-Korean bilingual lexicon consisting of foreign words.</Paragraph> <Paragraph position="5"> It may be argued that English lexicons or corpora can be used as source information, instead of Japanese resources. However, because not all English words have been imported into Korean, the extraction accuracy will decrease due to extraneous words.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Romanizing Japanese </SectionTitle> <Paragraph position="0"> Because the number of phones consisting of Japanese Katakana characters is limited, we manually produced the correspondence between each phone and its Roman representation. The numbers of Katakana characters and combined phones are 73 and 109, respectively. We also defined a symbol to represent a long vowel. In Japanese, the Hepbern and Kunrei systems are commonly used for romanization purposes. We use the Hepburn system, because its representation is similar to that in Korean, compared with the Kunrei system.</Paragraph> <Paragraph position="1"> However, specific Japanese phones, such as /ti/, do not exist in Korean. Thus, to adapt the Hepburn system to Korean, /ti/ and /tu/ are converted to /chi/ and /chu/, respectively.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Romanizing Korean </SectionTitle> <Paragraph position="0"> The number of Korean Hangul characters is much greater than that of Japanese Katakana characters.</Paragraph> <Paragraph position="1"> Each Hangul character is a combination of more than one consonant. The pronunciation of each character is determined by its component consonants.</Paragraph> <Paragraph position="2"> In Korean, there are types of consonant, i.e., the first consonant, vowel, and last consonant. The numbers of these consonants are 19, 21, and 27, respectively. The last consonant is optional. Thus, the number of combined characters is 11,172. However, to transliterate imported words, the official guideline suggests that only seven consonants be used as the last consonant. In EUC-KR, which is a standard coding system for Korean text, 2,350 common characters are coded independent of the pronunciation. Therefore, if we target corpora represented by EUC-KR, each of the 2,350 characters has to be corresponded to its Roman representation.</Paragraph> <Paragraph position="3"> We use Unicode, in which Hangul characters are sorted according to the pronunciation. Figure 2 depicts a fragment of the Unicode table for Korean, in which each line corresponds to a combination of the first consonant and vowel and each column corresponds to the last consonant. The number of columns is 28, i.e., the number of the last consonants and the case in which the last consonant is not used.</Paragraph> <Paragraph position="4"> From this figure, the following rules can be found: the first consonant changes every 21 lines, which corresponds to the number of vowels, the vowel changes every line (i.e., 28 characters) and repeats every 21 lines, the last consonant changes every column.</Paragraph> <Paragraph position="5"> Based on these rules, each character and its pronunciation can be identified by the three consonant types. Thus, we manually corresponded only the 68 rean Hangul characters.</Paragraph> <Paragraph position="6"> We use the official romanization system for Korean, but specific Korean phones are adapted to Japanese. For example, /j/ and /l/ are converted to /z/ and /r/, respectively.</Paragraph> <Paragraph position="7"> It should be noted that the adaptation is not invertible and thus is needed for both J-to-K and Kto-J directions.</Paragraph> <Paragraph position="8"> CompuTerm 2004 Poster Session - 3rd International Workshop on Computational Terminology72 For example, the English word &quot;cheese&quot;, which has been imported to both Korean and Japanese as a foreign word, is romanized as /chiseu/ in Korean and /ti:zu/ in Japanese. Here, /:/isthesymbol representing a Japanese long vowel. Using the adaptation, these expressions are converted to /chizu/ and /chi:zu/, respectively, which look more similar to each other, compared with the original strings.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Extracting term candidates from Korean corpora </SectionTitle> <Paragraph position="0"> To extract candidates of foreign words from a Korean corpus, we first extract phrases. This can be performed systematically, because Korean sentences are segmented on a phrase-by-phrase basis.</Paragraph> <Paragraph position="1"> Second, because foreign words are usually nouns, we use hand-crafted rules to remove post-position suffixes (e.g., Josa) and extract nouns from phrases.</Paragraph> <Paragraph position="2"> Third, we discard nouns including the last consonants that are not recommended for transliteration purposes in the official guideline. Although the guideline suggests other rules for transliteration, existing foreign words in Korean are not necessarily regulated by these rules.</Paragraph> <Paragraph position="3"> Finally, we consult a dictionary to discard existing Korean words, because our purpose is to extract new words. For this purpose, we experimentally use the dictionary for SuperMorph-K morphologi-</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Computing Similarity </SectionTitle> <Paragraph position="0"> Given romanized Japanese and Korean words, we compute the similarity between the two strings and select the pairs associated with the score above a threshold as translations. We use a DP (dynamic programming) matching method to identify the number of differences (i.e., insertion, deletion, and substitution) between two strings, on a alphabetby-alphabet basis.</Paragraph> <Paragraph position="1"> In principle, if two strings are associated with a smaller number of differences, the similarity between them becomes greater. For this purpose, a Dice-style coefficient can be used.</Paragraph> <Paragraph position="2"> However, while the use of consonants in transliteration is usually the same across languages, the use of vowels can vary significantly depending on the language. For example, the English word &quot;system&quot; is romanized as /sisutemu/ and /siseutem/ in Japanese and Korean, respectively. Thus, the differences in consonants between two strings should be penalized more than the differences in vowels.</Paragraph> <Paragraph position="3"> In view of the above discussion, we compute the similarity between two romanized words by Equation (1).</Paragraph> <Paragraph position="5"> Here, dc and dv denote the numbers of differences in consonants and vowels, respectively, and a is a</Paragraph> <Paragraph position="7"> parametric constant used to control the importance of the consonants. We experimentally set a =2. In addition, c and v denote the numbers of all consonants and vowels in the two strings. The similarity ranges from 0 to 1.</Paragraph> </Section> </Section> class="xml-element"></Paper>