File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-2204_metho.xml

Size: 15,807 bytes

Last Modified: 2025-10-06 14:09:24

<?xml version="1.0" standalone="yes"?>
<Paper uid="W04-2204">
  <Title>Automatic Construction of a Transfer Dictionary Considering Directionality</Title>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
2 Conventional Methods and
Problems
</SectionTitle>
    <Paragraph position="0"> The basic method of generating a bilingual dictionary through an intermediate language was proposed by Tanaka and Umemura (1994). They automatically constructed a Japanese-French dictionary with English as an intermediate language and manually checked the extracted results. In this sense, their method is not completely automatic. They looked up English translations for Japanese words, and then French translations of these English translations. Then, for each French word, they looked up all of its English translations. After that, they counted the number of shared English translations (one-time inverse consultation). This was extended to \two-time inverse consultation&amp;quot;. They looked up all the Japanese translations of all the English translations of a given French word and counted how many times the Japanese word appears. They reported that \comparing the generated dictionary with published dictionaries showed that data obtained are useful for revising and supplementing the vocabulary of existing dictionaries.&amp;quot; Their method shows the basic method of building a dictionary using English as an intermediate language. We applied and extended their method in automatic dictionary building especially considering the directionality of dictionaries.</Paragraph>
    <Paragraph position="1"> Tanaka and Umemura (1994) used four dictionaries in two directions (J)E, E)J, F)E and E)F). They rst harmonized the dictionaries by combining the J)E and E)J into a single J,E and the F)E and E)F into a harmonized F,E dictionary. We followed their basic method without harmonizing the dictionaries to emphasize the in uence of directionality. null In general, foreign word entries in a bilingual dictionary attempt to cover the entire vocabulary of the foreign language. However, foreign words that do not correspond to one's mother tongue are not recorded in a bilingual dictionary from one's mother tongue to the foreign language (Hartmann, 1983). A long explanatory phrase is replaced with a word that often does not perfectly correspond to the original.</Paragraph>
    <Paragraph position="2"> On the other hand, most of the index words from a foreign language to a mother tongue include many expository de nitions or explanations that focus on usage. Such syntactic information as POS and number as well as example sentences are rich compared with a dictionary from mother tongue to a foreign language. These characteristics should be considered when building a dictionary automatically.</Paragraph>
    <Paragraph position="3"> Bond et al. (2001) showed how semantic classes can be used along with an intermediate language to create a Japanese-to-Malay dictionary. They used semantic classes to rank translation equivalents so that word pairs with compatible semantic classes are chosen automatically as well as using English to link pairs. However, we cannot use this method for languages with poor language resources, in this case semantic ontology. Paik et al. (2001) improved the method to generate a Korean-to-Japanese (henceforth K)J) dictionary using multi-pivot criterion. They showed that it is useful to build dictionaries using appropriate multi-pivots. In this case, English is the intermediate language and shared Chinese characters between Korean and Japanese are used as pivots.</Paragraph>
    <Paragraph position="4"> However, none of the above methods considered the directionality of the dictionaries in their experiments. We ran three experiments to emphasize the e ects of directionality.1 There are many approaches to building a dictionary.</Paragraph>
    <Paragraph position="5"> But our focus will be on the generality of building any pair of dictionaries automatically using English as a pivot. In addition, we want to conrm various directionalities between a mother tongue and a foreign language.</Paragraph>
    <Paragraph position="6"> 1The rst two experiments were reported in Shirai and Yamamoto (2001) and Shirai et al. (2001). We present new evaluations in this paper.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Proposed Method
</SectionTitle>
    <Paragraph position="0"> We introduce three ways of constructing a K)J dictionary. First, we construct a K)J dictionary using a K)E dictionary and a J)E.</Paragraph>
    <Paragraph position="1"> Second, we show another way of constructing a K)J dictionary using an K)E dictionary and an E)J dictionary. Third, we use a novel way of dictionary building using an E)K and E)J to build a K)J dictionary. However, our method is not limited to building a K)J dictionary but can be extended to any other language pairs so long as X-to-English or English-to-X dictionaries exist. These three methods will cope with making dictionaries using any combination.</Paragraph>
    <Paragraph position="2"> We assume that the following conditions hold when building a bilingual dictionary: (1) Both the source language and the target language cannot be understood (to build a dictionary of unknown language pairs); (2) Various lexical information of the intermediate language (English) is accessible. (3) Limited information about the source and target language may be accessible.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.1 Lexical Resources
</SectionTitle>
      <Paragraph position="0"> Our method can be extended to any other language pairs if there are X-to-English and English-to-X dictionaries. It means that there are four possible combinations such as i) X-to-English and Y-to-English, ii) X-to-English and English-to-Y, iii) English-to-X and Y-to-English and iv) English-to-X and English-to-Y to build a X-to-Y dictionary. We tested i), ii) and iv) in this paper and we used the following dictionaries in our experiment:</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.2 Linking K)E and J)E
</SectionTitle>
      <Paragraph position="0"> Our method is based upon a one-time inverse consultation of Tanaka and Umemura (1994)( See Section 2.) to judge the word correspondences of Korean and Japanese.</Paragraph>
      <Paragraph position="1"> Lexical Resources used here is a K)E dictionary (50,826 entries) and a J)E dictionary 2(Yamagishi et al., 1997) 3 (Yamagishi and Gunji, 1991) 4 http://kr.engdic.yahoo.com (28,310 entries). There is a big di erence in the number of entries between the two dictionaries.</Paragraph>
      <Paragraph position="2"> This will a ect the total number of extracted words.</Paragraph>
      <Paragraph position="3"> For Evaluation, we use a similarity score S1 for a Japanese word j and a Korean word k is given in Equation (1), where E(w) is the set of English translations of w. This is equivalent to the Dice coe cient. The extracted word pairs and the score are evaluated by a human to keep the accuracy at approximately 90%.</Paragraph>
      <Paragraph position="5"> The most successful case is when all the English words in the middle are shared by K)E and J)E. Figure 1 shows how the link is realized and the similarity scores are shown in Table 1. The similarity score shows how many English words are shared by the two dictionaries: the higher the score, the higher possibility of successful linking. However, as Table 1 shows, we have to sort out the inappropriately matched pairs by comparing the S1 score of equation (1) against a threshold . The threshold allows us to exclude unfavorable results. For example, for words having one shared English translation equivalent, we have to discard the group (3) in  When the words translated from English match completely, the accuracy is high. And if the number of shared English translated words (jE(J) \ E(K)j) is high, then we get a high possibility of accurate matching of Korean and Japanese. However, accuracy deteriorates when the number of the shared English translated words (shown by the threshold) decreases as in (2) and (3) of Table 1. We solved this problem by varying the threshold according to the number of shared English equivalents.</Paragraph>
      <Paragraph position="6"> The value of the threshold was determined experimentally to achieve an accuracy rate of 90%.</Paragraph>
      <Paragraph position="7"> Result: Linking through English gives a total of 175,618 Korean-Japanese combinations.</Paragraph>
      <Paragraph position="8"> To make these combinations, 28,479 entries out of 50,826 from the K)E dictionary and 17,687 entries out of 28,310 from the J)E dictionary are used. As a result, we can extract 25,703 estimated good matches with an accuracy of 90%.</Paragraph>
    </Section>
    <Section position="3" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.3 Linking K)E and E)J
</SectionTitle>
      <Paragraph position="0"> Method: We investigated how to improve the extraction rate of equivalent pairs using an overlapping constraint method here.</Paragraph>
      <Paragraph position="1"> To extract Korean-Japanese word pairs, we searched consecutively through a K)E dictionary and then an E)J dictionary. We take English sets corresponding to Korean words from a Korean-English dictionary and Japanese translation sets for each English words from an E)J dictionary. The overlap similarity score S2 for a Japanese word j and a Korean word k is given in Equation (2), where E(w) is the set of English translations of w and J(E) is the bag of Japanese translations of all translations of E.</Paragraph>
      <Paragraph position="3"> After that, we test the narrowing down of translation pairs by extracting the overlapped words in the Japanese translation sets. See Figure 2.</Paragraph>
      <Paragraph position="4"> Lexical Resources: We used a K)E dictionary (50,826 entries), the same as the one used in section 3.2 and a E)J dictionary (52,369 entries). Compared to the resources used in our rst method, the number of entries are well balanced. null Evaluation: After extracting the overlapped words in the Japanese translation sets, the words were evaluated by humans. The main evaluation was to check the correlation between the overlaps and the matches of Korean and Japanese word pairs. Table 3 shows the overlapped number of shared English words and the number of index words of the  nary according to overlapped English words Result: Entries with a 1-to-1 match have</Paragraph>
      <Paragraph position="6"> matches (90%). If more than two overlaps occur, then the accuracy matching rate is as high as 84.0%. It means that the number of useful entries is the sum of the 1-to-1 matches and 2 or more overlaps: 19,007 (37.4% of the K)E entries) with 87% accuracy. However, using K)E and E)J there is a problem of polysemy in English words. For example, clean has two di erent POSs, adjective and verb in a K)E dictio- null nary. Unfortunately, this information cannot be used e ectively due to the lack of POS in K)E when linking them to a E)K dictionary. On the other hand, clean using E)J can be translated into either a3a33a4a30a6a34a8 , an adjective or a3a35a4 a6a11a10a53a13 a14 , a verb. This makes the range of overlap score widely distributed as shown in Figure 2. This is the reason using K)E and E)J is not as good as using K)E and J)E. We will discuss this more in section 4.</Paragraph>
    </Section>
    <Section position="4" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
3.4 Linking E)K and E)J
</SectionTitle>
      <Paragraph position="0"> As we have discussed in earlier sections, the characteristics of dictionaries di er according to their directionality. In this section, we introduce a novel method of matching translation equivalents of Korean and Japanese. From the Korean speaker's point of view, the E)K dictionary covers all English words, includes explanatory equivalents, and example sentences showing usage. The same thing is true for the E)J dictionary from a Japanese speaker's point of view. In this respect, we expect that the result of extraction is not as e ective as the other combinations such as K)E +J)E and K)E +E)J. On the other hand, we think that there must be other ways to exploit explanatory equivalents and example sentences.</Paragraph>
      <Paragraph position="1"> Method: First, we linked all the Korean and Japanese words if there is any shared English words. Then, we sorted them according to POSs to avoid the polysemous problem of POS.</Paragraph>
      <Paragraph position="2"> The left hand side of Figure 3 shows how we link Korean and Japanese pairs.</Paragraph>
      <Paragraph position="3"> Lexical Resources: We used a E)K dictionary (84,758 entries) and a E)J dictionary (52,369 entries). Both of the dictionaries have many more entries than the ones used in the previous two methods.</Paragraph>
      <Paragraph position="4"> Evaluation: We use similarity score S3 in Equation (3) as a threshold which is used to extract good matches.</Paragraph>
      <Paragraph position="6"> K(W): bag of Korean translations of set W J(W): bag of Japanese translations of set W E(w): set of English translations of word w jK(E)j means the number of Korean translation equivalents, andjJ(E)j means the number of Japanese translation equivalents. The sum of the numbers is divided by the number of intermediate English words. It is used to reduce the polysemous problem of English words.</Paragraph>
      <Paragraph position="7"> It is because it is hard to decide which translation is appropriate, if an English word has too many translation equivalents in Korean and Japanese. The value of threshold (S3) is shown in Table 4. We vary the threshold according to N = jE(j) \ E(k)j to maximize the number of successful matches experimentally. N represents the number of intermediate English words. For N=1, we only count one-to-one matches, which means one Korean and one Japanese are matched through only one English. The following are examples of being counted when N is 1-to-1: e.g. a36a38a37a14a39 a21a41a40a43a42a44a46a45 a21 -autosuggestion(n.)- a47a33a48a27a49</Paragraph>
      <Paragraph position="9"> lose many matching pairs by this threshold, but the accuracy rate for 1-to-1 is very high (96.5%).</Paragraph>
      <Paragraph position="10"> To save other matches when N=1, we need to examine further. In our experiment, a63 a21a64a19a66a65 a56a67 , a68 a69a71a70 a6 is rejected because lovely has two Korean translations and two Japanese translations; the</Paragraph>
      <Paragraph position="12"> pone this part to further research.</Paragraph>
      <Paragraph position="13">  Result: Table 4 shows the extracted 21,564 pairs of Korean and Japanese words. On average, 14,712 pairs match with a 68.3% success rate. The numbers in parentheses are estimated. null As expected, by setting this threshold we get fewer extracted words such as 10,360 words as shown in Table 4. However, the accuracy of the matched word pairs averages 92.7%.</Paragraph>
      <Paragraph position="14"> Comparison: To compare the three methods, we randomly chose 100 Korean words from a K)J dictionary6 which could be matched through all three methods. The number of extracted matches was 28 using K)E and J)E, 34 using K)E and E)J, and 13 using E)K and E)J. For K)E and E)J method, 21 out of 34 K)J pairs were found only in K)E and E)J method but not in K)E and J)E method. Among the 21 new K)J word pairs, only one pair is an error (not a good match). One new pair was found in E)K and E)J method. Therefore, combining all three methods gave 49 (28+20+1) di erent K)J pairs, a better result than any single method. These results are shown in Table 5. Clearly</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML