File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/94/c94-1009_evalu.xml
Size: 4,443 bytes
Last Modified: 2025-10-06 14:00:15
<?xml version="1.0" standalone="yes"?> <Paper uid="C94-1009"> <Title>BUILDING AN MT I)ICTIONARY FROM PARAI~LEI~ TEXTS BASED ON LINGUISTIC AND STATISTICAL INIi'ORMATION</Title> <Section position="8" start_page="79" end_page="80" type="evalu"> <SectionTitle> 6 EVALUATION AND DISCUSSION </SectionTitle> <Paragraph position="0"> To evaluate this method, we have estimated English translations of Japanese terms in seven parallel texts (Japanese specifications of patents on semiconductors and their English translations by human translators) and compared the translations with the correct data given by experts in building an MT dictionary. The size of a Japanese text is 7,508 to 26, 927 characters in 127 to 616 sentences; 99,286 characters in 2,148 sentences in total. Examples of correct translation pairs estimated with the highest TL are listed in Fig. 2.</Paragraph> <Paragraph position="1"> Table 2 shows the ranking of the correctly estimated translation pairs in seven sample texts. The upper row shows the average of seven individual texts; the lower shows the result using all seven texts in one time. The translation of over 70% of compound nouns is obtained as the first candidate, and over 80% in the top three. The result for unknown words is 54.0% and 65.0%. Though the accuracy for tile unknown words is relatively low, the estimation has been impossible for Yamamoto (1993). itere, tile terms whose cor,ect translations are not found in English texts are excepted from evaluation. .Such data occur when human experts give a noun translation for Japanese verbal noun term which is translated as a verb in the actual text. Tile ratio of this kind of translation pairs is abot, t 3%. Tile rate of the correct data is calculated by the ratio of the total occurrences.</Paragraph> <Paragraph position="2"> The accuracy for the average of unknown words is 52.4% in the top three. The result using all texts is significantly better than tile average because tile statistical information is the major factor in the current implementation. Use of more linguistic information such as in Dangan (1991) and Matsumoto (1993) would improve the total performance.</Paragraph> <Paragraph position="3"> Linguistic information has proven effective to estimate translations of low-frequency terms. Of terms which appeared only once in a Japanese text, 215 translations are obtained correctly as the first candidate from 327 terms (65.7%) in seven texts.</Paragraph> <Paragraph position="4"> The fourth example of compound nouns in Fig.</Paragraph> <Paragraph position="5"> 2 shows the advantage of statistical information because the correct translation was obtained in spite of the wrong word segmentation. The Japanese term really consists of three words (~J 9 A, 7&quot; F 1t ~, .z \]. ~ - .7&quot; ), each of whicb corresponds to &quot;cohtmn,&quot; &quot;address&quot; and &quot;strobe&quot; respectively. But word segmentation output four word.~ (~J 5' \],, T F 1t ~, l., ~ - .7&quot;) because &quot;:< I. ~-- 7&quot;&quot; is unknown and &quot;-~ 1-&quot; is known as &quot;strike.&quot; The CASES where no correct translatkm has been obtained needs to be examined. The major reasons for faih, res are: 1. Errors in mappi,lg conesponding units.</Paragraph> <Paragraph position="6"> 2. Errors in word segmentation of unknown compound wo,ds.</Paragraph> <Paragraph position="7"> Mapping unit errm.'s occur when the one-to-one nnit correspondence does not exist. The experiment using one text shows that 12 out of 98 Japanese sentences have no onE-to-one corresponding English sentence. For better unit correspondence, the trails should be smaller, for example, a clause or a verb phrase, so as to make the corresponding accuracy and frequency in text higher and statistical infornmtion more effective. It would improve the unit mapl)ing when one Japanese sentence is tnmslatcd into several English sentences or vice vmsa.</Paragraph> <Paragraph position="8"> ThE segmentation errors of unknown words arise often in case of Katakana compotmd word.</Paragraph> <Paragraph position="9"> Katakana is the phonetic alphabet in Jal)anese for spelling foreign words* Since many compound nourLs in a technical field consist of Katakana's with no space between component words, much larger lexicon will contribute to more accurate segmelltation.</Paragraph> </Section> class="xml-element"></Paper>