File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/c96-2098_evalu.xml
Size: 6,653 bytes
Last Modified: 2025-10-06 14:00:20
<?xml version="1.0" standalone="yes"?> <Paper uid="C96-2098"> <Title>Extraction of Lexical Translations from Non-Aligned Corpora</Title> <Section position="7" start_page="582" end_page="584" type="evalu"> <SectionTitle> 5 Experiments </SectionTitle> <Paragraph position="0"> Two experiments, local and global, were t)er~ formed t)y choosing the ,Japanese translations for English words. The corpora adoptc(t are the 30M Wall Street Jom'nal and 33M political and econonfi(&quot; articles of Asahi Newspaper.</Paragraph> <Paragraph position="1"> These were morphologically mlalyzed a to extract; nouns, verbs, adje(:tives and adverbs in canonical forms. Co-oecurren(:cs were counted using an 11 word window size. A and B were created as was depicted in Section 2.1. Elements under the certain thresholds were set at 0.0. The initial bilingual dictionary used was Edict (Breen, 1995), a word-to-word public dictionary.</Paragraph> <Section position="1" start_page="582" end_page="583" type="sub_section"> <SectionTitle> 5.1 Local Ambiguity Resolution </SectionTitle> <Paragraph position="0"> We randoinly extracted 11 successive words from cort)us. If the 6th c(mter word was ambiguous satisfying the following three conditions, the method explained in Section 3.1 was applied for (tisamt)iguation: its translations could t)e subjectively judged according to the context; the translations exist in Edict; Edict contains candidates other than the translation.</Paragraph> <Paragraph position="1"> The calculation choice was selected as the one which exhibited the minimum F(T). If all tit(; scores were the same, it was judged unresolved.</Paragraph> <Paragraph position="2"> When our subjectively ju(lged translations contained the calculation choice, it was correct, otherwise wrong. The experiinent was performed ,mtil the amhiguity was resolved for 200 ditferent words.</Paragraph> <Paragraph position="3"> Table 1 shows tile results. The applicability, the rate of words which were not unT~;solw:d, was apC-KIMMO and JUMAN were used.</Paragraph> <Paragraph position="4"> sion), the rate of the correct candidates among the words not unresolved, was 82.1% (124/(124+27)).</Paragraph> <Paragraph position="5"> The general trends found are as follows: * Translations reflect the trends in the corpus.</Paragraph> <Paragraph position="6"> For example, for doctor, I~ilf was calculated to be the best choice. Although I~ was also a candidate meaning medical doctor, it was dropped, because \[~ is a rather uncommon usage in the corpus.</Paragraph> <Paragraph position="7"> * Most words with two obviously different meanings were calculated to obtain the correct result.</Paragraph> <Paragraph position="8"> The applicability depends on the window size, such that the window should be large enough to focus the meaning of the word in question. The smaller the size is, the lower the rate should be. However, even if the window is made wider, the rate should eventually reach a certain limit.</Paragraph> </Section> <Section position="2" start_page="583" end_page="584" type="sub_section"> <SectionTitle> 5.2 Global Extraction of Translations Example of doctor </SectionTitle> <Paragraph position="0"> Figure 4 shows a small graph concerning doctor. The values attached to branches represent co-occurrences. Figure 5 shows the corresponding graph in Japanese. We initially defined A and B from these graphs, and To as each English word corresponding one-to-one to the Japanese word (with a value 1.0), except that three ambiguous words have the following correspondences: doctor -+ ~$(0.333), is+-(0.333), ~(0.334) pa~ent --+ ~C/~J-~ (0.5),,,~(0.5) paper ~ ~(0.5),~(0.5) SDM was applied to To and its convergence was judged with the first 5 digits of F(T). This needed 3400 iterations for convergence. The result T3400 is as follows: The wrong translation doctor--~ was dropped. Next, we removed from Figure 4 the portion of the graph which corresponds to the meaning of Ph.D. (Figure 6) so that the context was restricted to medical doctor. This time the result W~L~: doctor -~ ~C/~ (1.0), is=t: (0.0), ~ (0.0) I patient --~ ~C/J~5 (0.0), ~ (1.0) I Then we removed from Figure 4 the portion of the graph which corresponded to the meaning of medical doctor (Figure 7) so that the context was restricted to Ph.D, giving the result: doctor ---} ~g/li (0.0), is+- (1.0), ~}~ (0.0) \] paper --+ ~$9: (0.996), ~ (0.004) I These three small experiments show that the translation for doctor reflects the context represented by the source graph in LA.</Paragraph> <Paragraph position="1"> Minor Analysis of 378 words The best experiment is to calculate T for entire dictionary and measure how much the obtained translations reflect the corpus context, but this is difficult both from calculation time and judgment of context reflection. Hence we intentionally added to Edict the irrelevant translations to see if they drop out by our method.</Paragraph> <Paragraph position="2"> The irrelevant translations were chosen randomly so that they become the same number as those which existed originally in Edict. This was performed for entire English words in Edict. A was formed so that all the words involved are reachable within 2 co-occurrence branch distances from the test word. B is created by all translations of words involved in A. The test words applied SDM was selected by the following conditions: a test word has more than one candidate (ambiguous words) in Edict; its all co-occurrence values are greater than a certain threshold.</Paragraph> <Paragraph position="3"> If the candidates are separated into the following three categories through calculation: those which gain value, decrease value, and those whose values do not change, then we define the word in question as applicable. The following rates were calculated for CDIW (correctly dropped irrelevant words, ~he irrelevant words added as a noise and dropped correctly by the method) for each applicable test words: and dropped words. (correctness, recall) * The fraction between the number of CDIW and irrelevant words. (coverage) The results are listed in Table 2.</Paragraph> <Paragraph position="4"> The applicability and coverage depend on the threshold: the lower the threshold is, the higher the two rates increase because more co-occurrence information is obtained. The threshold is a trade-off with calculation time.</Paragraph> <Paragraph position="5"> About 15% (100-84.6) incorrectly dropped ones were original translations contained in Edict. These did not match the context, similar to the case of (doctor--~) shown in Section 5.1.</Paragraph> </Section> </Section> class="xml-element"></Paper>