File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/02/c02-1065_evalu.xml
Size: 6,117 bytes
Last Modified: 2025-10-06 13:58:45
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1065"> <Title>Measuring the Similarity between Compound Nouns in Difierent Languages Using Non-Parallel Corpora</Title> <Section position="6" start_page="0" end_page="0" type="evalu"> <SectionTitle> 6 Evaluation </SectionTitle> <Paragraph position="0"> In order to evaluate this method, an experimentontheselectionofEnglishtranslationsfor null a Japanese compound noun is conducted. We use two Japanese corpora, Nihon Keizai ShimbunCD-ROM1994(NIK,1.7millionsentences) null and Mainichi Shimbun CD-ROM 1995 (MAI, 2.4millionsentences), andtwoEnglishcorpora, The Wall Street Journal 1996 (WSJ, 1.2 million sentences) and Reuters Corpus 1996 (REU, 1.9 million sentences 2 Reuters (2000)) as contextual information. Two of them, NIK and WSJ, are flnancial newspapers, and the rest are general newspapers and news archives. All combinations ofJapaneseand Englishcorporaareexamined to reduce the bias of the combinations. 2Only part of the corpus is used because of flle size limitation in the data base management system in which the corpora are stored.</Paragraph> <Paragraph position="1"> First, 400 Japanese noun-noun type compounds cJ that appear frequently in NIK (more than 15 times) are randomly chosen. Next, the translation candidates TE for each cJ are collected from the English corpus WSJ as described in Section 2. The bilingual dictionary for MT system ALT-J/E, Goi-Taikei and a terminological dictionary (containing about 105,000 economic and other terms) are used to connect component words. As a result, 393 Japanese compound nouns and their translationcandidatesarecollectedandthecandidates null for 7 Japanese are not extracted. Note that we link component words widely in collecting the translation candidates because components in difierent languages do not always have direct translations, but do have similar meanings. For instance, for the economic term a0a2a1a4a3a6a5 setsubi toushi and its translation capital investment, while a3a4a5 toushi means investment, a0 a1 setsubi, which means equipment or facility, is not a direct translation of capital. The Goi-Taikei and the terminological dictionary are employedtolinksuchsimilarcomponentwords.</Paragraph> <Paragraph position="2"> Each Japanese word has a maximum of 5 candidates (average 3 candidates). We judge adequacy of chosen candidates by referring to articles and terminological dictionaries. More than 70% of Japanese have only one clearly correct candidate and many incorrect ones (e.g. securities company and paper company for a7a9a8a2a10a9a11 shouken gaisha). The others have two or more acceptable translations.</Paragraph> <Paragraph position="3"> Moreover, if all of the translation candidates ofcompound cJ arecorrect(45Japanese),orall are incorrect (86 Japanese), cJ and its translation candidates are removed from the test set. Foreach cJ intheremainderofthetestset(262 Japanese compound nouns, set 1), a translation cE thatisjudgedthemostsimilartocJ ischosen by measuring the similarity between the compounds. Set 1 is divided into two groups by the frequency of the Japanese word: set 1H (more than100times)andset1L(lessthan100times) to examine the efiect of frequency. In addition, the subset of set 1 (135 Japanese compound nouns, set 2), whose members also appear more than 15 times in MAI, is extracted, since set 1 includes compounds that do not appear frequently in MAI. On the other hand, the candidate that appears the most frequently in the didates Englishcorpuscanbeselectedasthebesttranslation of cJ. This simple procedure is the base-line that is compared to the proposed method. Table 3 shows the result of selecting the appropriate English translations for the Japanese compounds when each pair of corpora is used.</Paragraph> <Paragraph position="4"> The column of \freq(WSJ)&quot; is the result of choosing the most frequent candidates in WSJ.</Paragraph> <Paragraph position="5"> Since the methods based on word context reach higherprecisioninset1,thissuggeststhatword context vectors (cw) can e-ciently describe the contextofthetargetcompounds. Foralmostall sets, context word vector 2 provides higher precisionthancontextwordvector1. However,the efiect of consideration of syntactic dependency is minimal in this experiment.</Paragraph> <Paragraph position="6"> The precisions of word context vector in both MAI-WSJ and MAI-REU are low. This main reasonisthatmanyJapanesecompoundsinthe test set appear less frequently in MAI than in NIK, since the frequent compounds in NIK are chosenfortheset(theaveragefrequencyinNIK is 417, but that in MAI is 75). Therefore, less common co-occurrence words are found in MAI and the English corpora than in NIK and them.</Paragraph> <Paragraph position="7"> For instance, 25 Japanese compounds share no co-occurrence words with their translation candidates in MAI-WSJ while only one Japanese shares none in NIK-WSJ. In spite of this handicap, themethod basedon semanticcontext (ca) of MAI-WSJ/REU has the high precision. This result suggests that an abstraction of words can compensate for lack of word information to a certain extent.</Paragraph> <Paragraph position="8"> The proposed method based on word context (cw) surpasses the baseline method in precision in measuring the similarity between relatively frequent words. Our method can be used for compiling dictionary or machine translations.</Paragraph> <Paragraph position="9"> Table 4 shows examples of translation candi- null nyms and hyponyms datesandtheirsimilarityscores. Themark\+&quot; indicates correct translations. Some hyponyms and hypernyms or antonyms cannot be distinguished by this method, for these words often have similar co-occurring words. As shown in Table 5, using the example of power company and energy company, co-occurring words are very similar, therefore, their context vectorscannotassistindiscriminatingthesewords. null Thisproblemcannotberesolvedbythismethod alone. However, there is still room for improvement by combining other information, e.g., the similarity between components.</Paragraph> </Section> class="xml-element"></Paper>