File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/06/p06-1083_intro.xml
Size: 4,258 bytes
Last Modified: 2025-10-06 14:03:35
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1083"> <Title>Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary</Title> <Section position="4" start_page="657" end_page="657" type="intro"> <SectionTitle> 2 Related work </SectionTitle> <Paragraph position="0"> To the best of our knowledge, no attempt has been made to extract loanwords and their translations targeting Mongolian. Thus, we will discuss existing methods targeting other languages.</Paragraph> <Paragraph position="1"> In Korean, both loanwords and conventional words are spelled out using the Korean alphabet, called Hangul. Thus, the automatic extraction of loanwords in Korean is difficult, as it is in Mongolian. Existing methods that are used to extract loanwords from Korean corpora (Myaeng and Jeong, 1999; Oh and Choi, 2001) use the phonetic differences between conventional Korean words and loanwords. However, these methods require manually tagged training corpora, and are expensive. A number of corpus-based methods are used to extract bilingual lexicons (Fung and McKeown, 1996; Smadja, 1996). These methods use statistics obtained from a parallel or comparable bilingual corpus, and extract word or phrase pairs that are strongly associated with each other. However, these methods cannot be applied to a language pair where a large parallel or comparable corpus is not available, such as Mongolian and Japanese.</Paragraph> <Paragraph position="2"> Fujii et al. (2004) proposed a method that does not require tagged corpora or parallel corpora to extract loanwords and their translations. They used a monolingual corpus in Korean and a dictionary consisting of Japanese Katakana words. They assumed that loanwords in multiple countries corresponding to the same source word are phonetically similar. For example, the English word &quot;system&quot; has been imported into Korean, Mongolian, and Japanese. In these languages, the romanized words are &quot;siseutem&quot;, &quot;sistem&quot;, and &quot;shisutemu&quot;, respectively.</Paragraph> <Paragraph position="3"> It is often the case that new terms have been imported into multiple languages simultaneously, because the source words are usually influential across cultures. It is feasible that a large number of loanwords in Korean can also be loanwords in Japanese. Additionally, Katakana words can be extracted from Japanese corpora with a high accuracy. Thus, Fujii et al. (2004) extracted the loanwords in Korean corpora that were phonetically similar to Japanese Katakana words. Because each of the extracted loanwords also corresponded to a Japanese word during the extraction process, a Japanese-Korean bilingual dictionary was produced in a single framework.</Paragraph> <Paragraph position="4"> However, a number of open questions remain from Fujii et al.'s research. First, their stemming method can only be used for Korean. Second, their accuracy in extracting loanwords was low, and thus, an additional extraction method was required. Third, they did not report on the accuracy of extracting translations, and finally, because they used Dynamic Programming (DP) matching for computing the phonetic similarities between Korean and Japanese words, the computational cost was prohibitive.</Paragraph> <Paragraph position="5"> In an attempt to extract Chinese-English translations from corpora, Lam et al. (2004) proposed a similar method to Fujii et al. (2004).</Paragraph> <Paragraph position="6"> However, they searched the Web for Chinese-English bilingual comparable corpora, and matched named entities in each language corpus if they were similar to each other. Thus, Lam et al.'s method cannot be used for a language pair where comparable corpora do not exist. In contrast, using Fujii et al.'s (2004) method, the Katakana dictionary and a Korean corpus can be independent.</Paragraph> <Paragraph position="7"> In addition, Lam et al.'s method requires Chinese-English named entity pairs to train the similarity computation. Because the accuracy of extracting named entities was not reported, it is not clear to what extent this method is effective in extracting loanwords from corpora.</Paragraph> </Section> class="xml-element"></Paper>