File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1083_metho.xml
Size: 15,258 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1083"> <Title>Extracting loanwords from Mongolian corpora and producing a Japanese-Mongolian bilingual dictionary</Title> <Section position="5" start_page="657" end_page="661" type="metho"> <SectionTitle> 3 Methodology </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="657" end_page="658" type="sub_section"> <SectionTitle> 3.1 Overview </SectionTitle> <Paragraph position="0"> In view of the discussion outlined in Section 2, we enhanced the method proposed by Fujii et al. (2004) for our purpose. Figure 1 shows the method that we used to extract loanwords from a Mongolian corpus and to produce a Japanese-Mongolian bilingual dictionary. Although the basis of our method is similar to that used by Fujii et al. (2004), &quot;Stemming&quot;, &quot;Extracting loanwords based on rules&quot;, and &quot;N-gram retrieval&quot; are introduced in this paper. First, we perform stemming on a Mongolian corpus to segment phrases into a content word and one or more suffixes.</Paragraph> <Paragraph position="1"> Second, we discard segmented content words if they are in an existing dictionary, and extract the remaining words as candidate loanwords.</Paragraph> <Paragraph position="2"> Third, we use our own handcrafted rules to extract loanwords from the candidate loanwords. While the rule-based method can extract loanwords with a high accuracy, a number of loanwords cannot be extracted using predefined rules.</Paragraph> <Paragraph position="3"> Fourth, as performed by Fujii et al. (2004), we use a Japanese Katakana dictionary and extract a candidate loanword that is phonetically similar to a Katakana word as a loanword. We romanize the candidate loanwords that were not extracted using the rules. We also romanize all words in the Katakana dictionary.</Paragraph> <Paragraph position="4"> However, unlike Fujii et al. (2004), we use N-gram retrieval to limit the number of Katakana words that are similar to the candidate loanwords. Then, we compute the phonetic similarities between each candidate loanword and each retrieved Katakana word using DP matching, and select a pair whose score is above a predefined threshold. As a result, we can extract loanwords in Mongolian and their translations in Japanese simultaneously.</Paragraph> <Paragraph position="5"> Finally, to identify Japanese translations for the loanwords extracted using the rules defined in the third step above, we perform N-gram retrieval and DP matching.</Paragraph> <Paragraph position="6"> We will elaborate further on each step in Sections 3.2-3.7.</Paragraph> </Section> <Section position="2" start_page="658" end_page="660" type="sub_section"> <SectionTitle> 3.2 Stemming </SectionTitle> <Paragraph position="0"> A phrase in Mongolian consists of a content word and one or more suffixes. A content word can potentially be inflected in a phrase. Figure 2 shows (b) Vowel elimination. azhil +aas + aa- azhlaasaa Work + Ablative Case +Reflexive (c) Vowel insertion. akh + d - akhad Brother + Dative Case (d) Consonant insertion. baishin + iin- baishinghiin Building + Genitive Case (e) The letter &quot;' &quot; is converted to &quot;i &quot;, and the vowel is eliminated.</Paragraph> <Paragraph position="1"> surghuul' + aas- surghuulias School + Ablative Case the inflection types of content words in phrases. In phrase (a), there is no inflection in the content word &quot;nom (book)&quot; concatenated with the suffix &quot;yn (genitive case)&quot;.</Paragraph> <Paragraph position="2"> However, in phrases (b)-(e) in Figure 2, the content words are inflected. Loanwords are also inflected in all of these types, except for phrase (b). Thus, we have to identify the original form of a content word using stemming. While most loanwords are nouns, a number of loanwords can also be verbs. In this paper, we propose a stemming method for nouns. Figure 3 shows our stemming method. We will explain our stemming method further, based on Figure 3.</Paragraph> <Paragraph position="3"> First, we consult a &quot;Suffix dictionary&quot; and perform backward partial matching to determine whether or not one or more suffixes are concatenated at the end of a target phrase.</Paragraph> <Paragraph position="4"> Second, if a suffix is detected, we use a &quot;Suffix segmentation rule&quot; to segment the suffix and extract Third, we investigate whether or not the vowel elimination in phrase (b) in Figure 2 occurred in the extracted noun. Because the vowel elimination occurs only in the last vowel of a noun, we check the last two characters of the extracted noun. If both of the characters are consonants, the eliminated vowel is inserted using a &quot;Vowel insertion rule&quot; and the noun is converted into its original form.</Paragraph> <Paragraph position="5"> Existing Mongolian stemming methods (Ehara et al., 2004; Sanduijav et al., 2005) use noun dictionaries. Because we intend to extract loanwords that are not in existing dictionaries, the above methods cannot be used. Noun dictionaries have to be updated as new words are created.</Paragraph> <Paragraph position="6"> Our stemming method does not require a noun dictionary. Instead, we manually produced a suffix dictionary, suffix segmentation rule, and vowel insertion rule. However, once these resources are produced, almost no further compilation is required. The suffix dictionary consists of 37 suffixes that can concatenate with nouns. These suffixes are postpositional particles. Table 1 shows the dictionary entries, in which the inflection forms of the postpositional particles are shown in parentheses. The suffix segmentation rule consists of 173 rules. We show examples of these rules in Figure 4. Even if suffixes are identical in their phrases, the segmentation rules can be different, depending on the counterpart noun.</Paragraph> <Paragraph position="7"> In Figure 4, the suffix &quot;iin &quot; matches both the noun phrases (a) and (b) by backward partial matching. However, each phrase is segmented by a n , y , yn , ny , ii, iin , nii ygh , iigh , gh d , t aas (ias ), oos (ios ), ees , oo s aar (iar ), oor (ior), eer , oor tai , toi , tei aa ( ia ), oo ( io ), ee , oo uud (iud ), uud (i u d ) deferent rule independently. The underlined suffixes are segmented in each phrase, respectively. In phrase (a), there is no inflection, and the suffix is easily segmented. However, in phrase (b), a consonant insertion has occurred. Thus, both the inserted consonant, &quot;gh &quot;, and the suffix have to be removed. The vowel insertion rule consists of 12 rules. To insert an eliminated vowel and extract the original form of the noun, we check the last two characters of a target noun. If both of these are consonants, we determine that a vowel was eliminated.</Paragraph> <Paragraph position="8"> However, a number of nouns end with two consonants inherently, and therefore, we referred to a textbook on Mongolian grammar (Bayarmaa, 2002) to produce 12 rules to determine when to insert a vowel between two consecutive consonants.</Paragraph> <Paragraph position="9"> For example, if any of &quot;m &quot;, &quot;gh &quot;, &quot;l &quot;, &quot;b &quot;, &quot;v &quot;, or &quot;r &quot; are at the end of a noun, a vowel is inserted. However, if any of &quot;ts &quot;, &quot;zh &quot;, &quot;z &quot;, &quot;s &quot;, &quot;d &quot;, &quot;t &quot;, &quot;sh &quot;, &quot;ch &quot;, or &quot;kh &quot; are the second to last consonant in a noun, a vowel is not inserted.</Paragraph> <Paragraph position="10"> The Mongolian vowel harmony rule is a phonological rule in which female vowels and male vowels are prohibited from occurring in a single word together (with the exception of proper nouns). We used this rule to determine which vowel should be inserted. The appropriate vowel is determined by the first vowel of the first syllable in the target noun. For example, if there are &quot;a &quot; and &quot;u &quot; in the first syllable, the vowel &quot;a &quot; is inserted between the last two consonants.</Paragraph> </Section> <Section position="3" start_page="660" end_page="660" type="sub_section"> <SectionTitle> 3.3 Extracting candidate loanwords </SectionTitle> <Paragraph position="0"> After collecting nouns using our stemming method, we discard the conventional Mongolian nouns. We discard nouns defined in a noun dictionary (Sanduijav et al., 2005), which includes 1,926 nouns. We also discard proper nouns and abbreviations. The first characters of proper nouns, such as &quot;Erdenebat (Erdenebat)&quot;, and all the characters of abbreviations, such as &quot;TsShNI (Nuclear research centre)&quot;, are written using capital letters in Mongolian. Thus, we discard words that are written using capital characters, except those occurring at the beginning of sentences. In addition, because &quot;o &quot; and &quot;u &quot; are not used to spell out Western languages, words including those characters are also discarded.</Paragraph> </Section> <Section position="4" start_page="660" end_page="660" type="sub_section"> <SectionTitle> 3.4 Extracting loanwords based on rules </SectionTitle> <Paragraph position="0"> We manually produced seven rules to identify loanwords in Mongolian. Words that match with one of the following rules are extracted as loanwords.</Paragraph> <Paragraph position="1"> (a) A word including the consonants &quot;k &quot;, &quot;p &quot;, &quot;f &quot;, or &quot;shch &quot;.</Paragraph> <Paragraph position="2"> These consonants are usually used to spell out foreign words.</Paragraph> <Paragraph position="3"> (b) A word that violated the Mongolian vowel harmony rule.</Paragraph> <Paragraph position="4"> Because of the vowel harmony rule, a word that includes female and male vowels, which is not based on the Mongolian phonetic system, is probably a loanword.</Paragraph> <Paragraph position="5"> (c) A word beginning with two consonants.</Paragraph> <Paragraph position="6"> A conventional Mongolian word does not begin with two consonants.</Paragraph> <Paragraph position="7"> (d) A word ending with two particular consonants. A word whose penultimate character is any of: &quot;p &quot;, &quot;b &quot;, &quot;t &quot;, &quot;ts &quot;, &quot;ch &quot;, &quot;z &quot;, or &quot;sh &quot; and whose last character is a consonant violates Mongolian grammar, and is probably a loanword.</Paragraph> <Paragraph position="8"> (e) A word beginning with the consonant &quot;v &quot;. In a modern Mongolian dictionary (Ozawa, 2000), there are 54 words beginning with &quot;v &quot;, of which 31 are loanwords. Therefore, a word beginning with &quot;v &quot; is probably a loanword. (f) A word beginning with the consonant &quot;r &quot;. In a modern Mongolian dictionary (Ozawa, 2000), there are 49 words beginning with &quot;r &quot;, of which only four words are conventional Mongolian words. Therefore, a word beginning with &quot;r &quot; is probably a loanword.</Paragraph> <Paragraph position="9"> (g) A word ending with &quot;<consonant> + i &quot;. We discovered this rule empirically.</Paragraph> </Section> <Section position="5" start_page="660" end_page="660" type="sub_section"> <SectionTitle> 3.5 Romanization </SectionTitle> <Paragraph position="0"> We manually aligned each Mongolian Cyrillic alphabet to its Roman representation</Paragraph> <Paragraph position="2"> In Japanese, the Hepburn and Kunrei systems are commonly used for romanization proposes. We used the Hepburn system, because its representation is similar to that used in Mongolian, compared to the Kunrei system.</Paragraph> <Paragraph position="3"> However, we adapted 11 Mongolian romanization expressions to the Japanese Hepburn romanization. For example, the sound of the letter &quot;L&quot; does not exist in Japanese, and thus, we converted &quot;L&quot; to &quot;R&quot; in Mongolian.</Paragraph> </Section> <Section position="6" start_page="660" end_page="661" type="sub_section"> <SectionTitle> 3.6 N-gram retrieval </SectionTitle> <Paragraph position="0"> By using a document retrieval method, we efficiently identify Katakana words that are phonetically similar to a candidate loanword. In other words, we use a candidate loanword, and each Katakana word as a query and a document, respectively. We call this method &quot;N-gram retrieval&quot;.</Paragraph> <Paragraph position="1"> Because the N-gram retrieval method does not consider the order of the characters in a target word, the accuracy of matching two words is low, but the computation time is fast. On the other hand, because DP matching considers the order of the characters in a target word, the accuracy of matching two words is high, but the computation time is slow. We combined these two methods to achieve a high matching accuracy with a reasonable computation time.</Paragraph> <Paragraph position="2"> First, we extract Katakana words that are phonetically similar to a candidate loanword using N-gram retrieval. Second, we compute the similarity between the candidate loanword and each of the retrieved Katakana words using DP matching to improve the accuracy.</Paragraph> <Paragraph position="3"> We romanize all the Katakana words in the dictionary and index them using consecutive N characters. We also romanize each candidate loanword when use as a query. We experimentally set N = 2, and use the Okapi BM25 (Robertson et al., 1995) for the retrieval model.</Paragraph> </Section> <Section position="7" start_page="661" end_page="661" type="sub_section"> <SectionTitle> 3.7 Computing phonetic similarity </SectionTitle> <Paragraph position="0"> Given the romanized Katakana words and the romanized candidate loanwords, we compute the similarity between the two strings, and select the pairs associated with a score above a predefined threshold as translations. We use DP matching to identify the number of differences (i.e., insertion, deletion, and substitution) between two strings on an alphabet-by-alphabet basis.</Paragraph> <Paragraph position="1"> While consonants in transliteration are usually the same across languages, vowels can vary depending on the language. The difference in consonants between two strings should be penalized more than the difference in vowels. We compute the similarity between two romanized words using Equation (1).</Paragraph> <Paragraph position="2"> Here, dc and dv denote the number of differences in consonants and vowels, respectively, and a is a parametric consonant used to control the importance of the consonants. We experimentally set a = 2.</Paragraph> <Paragraph position="3"> Additionally, c and v denote the number of all the consonants and vowels in the two strings, respectively. The similarity ranges from 0 to 1.</Paragraph> </Section> </Section> class="xml-element"></Paper>