File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/06/p06-1083_evalu.xml
Size: 7,286 bytes
Last Modified: 2025-10-06 13:59:38
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1083"> <Title>Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary</Title> <Section position="6" start_page="661" end_page="662" type="evalu"> <SectionTitle> 4 Experiments </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="661" end_page="661" type="sub_section"> <SectionTitle> 4.1 Method </SectionTitle> <Paragraph position="0"> We collected 1,118 technical reports published in Mongolian from the &quot;Mongolian IT Park&quot; and used them as a Mongolian corpus. The number of phrase types and phrase tokens in our corpus were 110,458 and 263,512, respectively.</Paragraph> <Paragraph position="1"> We collected 111,116 Katakana words from multiple Japanese dictionaries, most of which were technical term dictionaries.</Paragraph> <Paragraph position="2"> We evaluated our method from four perspectives: &quot;stemming&quot;, &quot;loanword extraction&quot;, &quot;translation extraction&quot;, and &quot;computational cost.&quot; We will discuss these further in Sections 4.2-4.5, respectively.</Paragraph> </Section> <Section position="2" start_page="661" end_page="661" type="sub_section"> <SectionTitle> 4.2 Evaluating stemming </SectionTitle> <Paragraph position="0"> We randomly selected 50 Mongolian technical http://www.itpark.mn/ (May, 2006) reports from our corpus, and used them to evaluate the accuracy of our stemming method. These technical reports were related to: medical science (17), geology (10), light industry (14), agriculture (6), and sociology (3). In these 50 reports, the number of phrase types including conventional Mongolian nouns and loanword nouns was 961 and 206, respectively. We also found six phrases including loanword verbs, which were not used in the evaluation.</Paragraph> <Paragraph position="1"> Table 2 shows the results of our stemming experiment, in which the accuracy for conventional Mongolian nouns was 98.7% and the accuracy for loanwords was 94.6%. Our stemming method is practical, and can also be used for morphological analysis of Mongolian corpora.</Paragraph> <Paragraph position="2"> We analyzed the reasons for any failures, and found that for 12 conventional nouns and 11 loanwords, the suffixes were incorrectly segmented.</Paragraph> </Section> <Section position="3" start_page="661" end_page="662" type="sub_section"> <SectionTitle> 4.3 Evaluating loanword extraction </SectionTitle> <Paragraph position="0"> We used our stemming method on our corpus and selected the most frequently used 1,300 words. We used these words to evaluate the accuracy of our loanword extraction method. Of these 1,300 words, 165 were loanwords. We varied the threshold for the similarity, and investigated the relationship between precision and recall. Recall is the ratio of the number of correct loanwords extracted by our method to the total number of correct loanwords. Precision is the ratio of the number of correct loanwords extracted by our method to the total number of words extracted by our method. We extracted loanwords using rules (a)-(g) defined in Section 3.4. As a result, 139 words were extracted.</Paragraph> <Paragraph position="1"> Table 3 shows the precision and recall of each rule.</Paragraph> <Paragraph position="2"> The precision and recall showed high values using &quot;All rules&quot;, which combined the words extracted by rules (a)-(g) independently.</Paragraph> <Paragraph position="3"> We also extracted loanwords using the phonetic similarity, as discussed in Sections 3.6 and 3.7.</Paragraph> <Paragraph position="4"> We used the N-gram retrieval method to obtain up to the top 500 Katakana words that were similar to each candidate loanword. Then, we selected up to the top five pairs of a loanword and a Katakana word whose similarity computed using Equation (1) was greater than 0.6. Table 4 shows the results of our similarity-based extraction.</Paragraph> <Paragraph position="5"> Both the precision and the recall for the similarity-based loanword extraction were lower than those for the &quot;All rules&quot; data listed in Table 3. We also evaluated the effectiveness of a combination of the N-gram and DP matching methods. We performed similarity-based extraction after rule-based extraction. Table 5 shows the results, in which the data of the &quot;Rule&quot; are identical to those of the &quot;All rules&quot; data listed in Table 3. However, the &quot;Similarity&quot; data are not identical to those listed in Table 4, because we performed similarity-based extraction using only the words that were not extracted by rule-based extraction.</Paragraph> <Paragraph position="6"> When we combined the rule-based and similarity-based methods, the recall improved from 84.2% to 91.5%. The recall value should be high when a human expert modifies or verifies the resultant dictionary.</Paragraph> <Paragraph position="7"> Figure 5 shows example of extracted loanwords in Mongolian and their English glosses.</Paragraph> </Section> <Section position="4" start_page="662" end_page="662" type="sub_section"> <SectionTitle> 4.4 Evaluating Translation extraction </SectionTitle> <Paragraph position="0"> In the row &quot;Both&quot; shown in Table 5, 151 loanwords were extracted, for each of which we selected up to the top five Katakana words whose similarity translations. As a result, Japanese translations were extracted for 109 loanwords. Table 6 shows the results, in which the precision and recall of extracting Japanese-Mongolian translations were 56.2% and 72.2%, respectively.</Paragraph> <Paragraph position="1"> We analyzed the data and identified the reasons for any failures. For five loanwords, the N-gram retrieval failed to search for the similar Katakana words. For three loanwords, the phonetic similarity computed using Equation (1) was not high enough for a correct translation. For 27 loanwords, the Japanese translations did not exist inherently. For seven loanwords, the Japanese translations existed, but were not included in our Katakana dictionary.</Paragraph> <Paragraph position="2"> Figure 6 shows the Japanese translations extracted for the loanwords shown in Figure 5.</Paragraph> </Section> <Section position="5" start_page="662" end_page="662" type="sub_section"> <SectionTitle> 4.5 Evaluating computational cost </SectionTitle> <Paragraph position="0"> We randomly selected 100 loanwords from our corpus, and used them to evaluate the computational cost of the different extraction methods. We compared the computation time and the accuracy of &quot;N-gram&quot;, &quot;DP matching&quot;, and &quot;N-gram + DP matching&quot; methods. The experiments were performed using the same PC (CPU = Pentium III 1 GHz dual, Memory = 2 GB).</Paragraph> <Paragraph position="1"> Table 7 shows the improvement in computation time by &quot;N-gram + DP matching&quot; on &quot;DP matching&quot;, and the average rank of the correct translations for &quot;N-gram&quot;. We improved the efficiency, while maintaining the sorting accuracy of the translations.</Paragraph> </Section> </Section> class="xml-element"></Paper>