File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/evalu/96/w96-0107_evalu.xml
Size: 4,739 bytes
Last Modified: 2025-10-06 14:00:21
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0107"> <Title>Automatic Extraction of Word Sequence Correspondences in Parallel Corpora</Title> <Section position="7" start_page="81" end_page="86" type="evalu"> <SectionTitle> 5 Experiments of Translation Pair Extraction </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="81" end_page="81" type="sub_section"> <SectionTitle> 5.1 The settings </SectionTitle> <Paragraph position="0"> We used parallel corpora of three distinct domains: (1) a computer manual (9,792 sentence pairs), (2) a scientific journal (12,200 sentence pairs), and (3) business contract letters (10,016 sentence pairs). All the Japanese and English sentences are aligned and morphologically analyzed 1.</Paragraph> <Paragraph position="1"> The settings of the experiments are as follows: The maximum length of the extracted word sequences is set at 10. The initial value of .fmi, is set at the half of the highest number of occurrences of extracted word sequences and is lowered by dividing by two until it reaches to or under 10, then it is lowered by one in each iteration until 2.</Paragraph> <Paragraph position="3"> table shows the numbers of distinct content words, those of two or more occurrences, and the numbers of word sequences of two or more occurrences.</Paragraph> </Section> <Section position="2" start_page="81" end_page="86" type="sub_section"> <SectionTitle> 5.2 The results </SectionTitle> <Paragraph position="0"> Tables 3, 4 and 5 shows the statistics obtained from the experiments. The columns specify the numbers of approved translation pairs. The correctness of the translation pairs are checked by a human inspector. A &quot;near miss&quot; means that the pair is not perfectly correct but some parts of the pair constitute the correct translation.</Paragraph> <Paragraph position="1"> It is noticeable that the pairs with high frequencies give very accurate translation in the cases of the computer manual and the business letters, whereas the scientific journal does not necessarily gives high accuracy to highly frequent pairs. The reason is that the former two corpora are really in a homogeneous domain, while the corpus of scientific journal is a complex of distinct scientific fields. The former two corpora reveal a worse performance with the pairs with low frequency threshold. This is because those corpora frequently contain a number of lengthy fixed expression or particular collocations. One such example is that &quot;p type (silicon)&quot; frequently collocates with &quot;n type (silicon),&quot; making the correspondence uncertain.</Paragraph> <Paragraph position="2"> The science journal shows a stable accuracy of translation pair extraction. The accuracy exceeds 90% in most of the stages. The reason would be that scientific papers do not repeat fixed expression and the terminologies are used not in a fixed way.</Paragraph> <Paragraph position="3"> Table 6 summarizes the combination of the length of English and Japanese word sequences.</Paragraph> <Paragraph position="4"> The fraction in each entry shows the number of correct pairs over the number of extracted pairs.</Paragraph> <Paragraph position="5"> This table indicates that translation pairs of lengthy or unbalanced sequences are safely regarded some of typical word sequence pairs. Many of Japanese translation of English technical terms are automatically detected. Table 8 lists the top 30 pairs from the experiment on the business contract letters.</Paragraph> <Paragraph position="6"> The method is capable of getting interesting translation patterns. For example, &quot;~l~l~&quot; and &quot;~ll~&quot; are found to correspond to &quot;trade secret&quot; and &quot;business hour&quot; respectively. Note that Japanese word &quot;~&quot; is translated into different English words according to their occurrences with distinct word.</Paragraph> <Paragraph position="7"> Table 9 shows the recall ratio based on the results of the experiments. The figures show the numbers of words that are included at least one extracted translation pairs. The recall rates are shown in parentheses, which indicates how much proportion of the words with two or more occurrences in the corpora are finally participated in at least one translation pair. The major reason that the recall is not sufficiently high is that we decided to use a rather severe condition on selecting a translation pairs in Step 6 in the algorithm. The condition may be loosen to get better recall ratio though we may lose high precision. We have not yet tested our method with other conditions.</Paragraph> </Section> </Section> class="xml-element"></Paper>