File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/e06-1029_metho.xml
Size: 18,159 bytes
Last Modified: 2025-10-06 14:10:05
<?xml version="1.0" standalone="yes"?> <Paper uid="E06-1029"> <Title>Compiling French-Japanese Terminologies from the Web</Title> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Term Alignment </SectionTitle> <Paragraph position="0"> Once we have collected related terms in both French and Japanese, we must link the terms in the source language to the terms in the target language. Our alignment procedure is twofold.</Paragraph> <Paragraph position="1"> First, we first generate Japanese translation candidates for each collected French term. Second, we select the most likely translation(s) from the set of candidates. This is similar to the generation and selection procedures used in the literature (Baldwin and Tanaka (2004), Cao and Li, Langkilde and Knight (1998)).</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.1 Translation Candidates Generation </SectionTitle> <Paragraph position="0"> Translation candidates are generated using a compositional method, which can be divided in three steps. First, we decompose the French MWTs into combinations of shorter MWU elements. Second, we look up the elements in bilingual dictionaries. Third, we recompose translation candidates by generating different combinations of translated elements.</Paragraph> <Paragraph position="1"> Decomposition In accordance with Daille et al., we define the length of a MWU as the number of content words it contains. Let n be the length of the MWT to decompose. We produce all the combinations of MWU elements of length less or equal to n. For example, consider the French translation of &quot;knowledge based system&quot;: It has a length of three and yields the following four combinations : Note the treatment given to the prepositions and determiners: we leave them in place when they are interposed between content words within elements, otherwise we remove them.</Paragraph> <Paragraph position="2"> Dictionary Lookup We look up each element in bilingual dictionaries. Because some words appear in their inflected forms, we use their lemmata. In the example given above, we look up connaissance (lemma) rather than connaissances (inflected). Note that we do not lemmatize MWUs such as base de connaissances. This is due to the complexity of gender and number agreements of French compounds. However, only a small part of the MWTs are collected in their inflected forms, and French-Japanese bilingual dictionaries do not contain that many MWTs to begin with. The performance hit should therefore be minor.</Paragraph> <Paragraph position="3"> Already at this stage, we can anticipate problems arising from the insufficient coverage of including itself.</Paragraph> <Paragraph position="4"> systeme a base de connaissances Noun Prep Noun Prep Noun [systeme a [base de [connaissances] [systeme] [base de [connaissances] [systeme a [base] [connaissances] [systeme] [base] [connaissances] French-Japanese lexicon resources. Bilingual dictionaries may not have enough entries, and existing entries may not include a great variety of translations for every sense. The former problem has no easy solution, and is one of the reasons we are conducting this research. The latter can be partially remedied by using thesauri - we augment each element's translation set by looking up in thesauri all the translations obtained with bilingual dictionaries.</Paragraph> <Paragraph position="5"> Recomposition To recompose the translation candidates, we simply generate all suitable combinations of translated elements for each decomposition. The word order is inverted to take into account the different constraints in French and Japanese. In the example above, if the lookup phase gave {Zhi Shi chishiki}, {Tu Tai dodai, besu besu} and {Ti Xi t aikei, sisutemu shisutemu} as respective translation sets for systeme, base and connaissance, the fourth decomposition given above would yield the following candidates: If we do not find any translation for one of the elements, the generation fails.</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.2 Translation Selection </SectionTitle> <Paragraph position="0"> Selection consists of picking the most likely translation from the translation candidates we have generated. To discern the likely from the unlikely, we use the empirical evidence provided by the set of Japanese terms related to the seed.</Paragraph> <Paragraph position="1"> We believe that if a candidate is present in that set, it could well be a valid translation, as the French MWT in consideration is also related to the seed. Accordingly, our selection process consists of picking those candidates for which we find a complete match among the related terms.</Paragraph> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 3.3 Relevance of Compositional Methods </SectionTitle> <Paragraph position="0"> The automatic translation of MWTs is no simple task, and it is worthwhile asking if it is best tackled with a compositional method. Intricate problems have been reported with the translations of compounds (Daille and Morin, Baldwin and Tanaka), notably: * fertility: source and target MWTs can be of different lengths. For example, table de verite (truth table) contains two content words and translates into Zhen Li *Zhi *Biao shinri * chi * hyo (lit. truth-value-table), which contains three.</Paragraph> <Paragraph position="1"> * variability of forms in the translations: MWTs can appear in many forms.</Paragraph> <Paragraph position="2"> For example, champ electromagnetique (electromagnetic field) translates both into Dian Ci *Chang denji* ba (lit. electromagnetic field)Dian Ci *Jie denji* kai (lit. electromagnetic &quot;region&quot;).</Paragraph> <Paragraph position="3"> * constructional variability in the translations: source and target MWTs have different morphological structures. For example, in the pair apprentissage automatique- Ji Jie *Xue Xi kikai * gakushu (machine learning) we have (N-Adj)- (N-N). In the pair programmation par contraintes- patan*Ren Shi patan* ninshiki (pattern recognition) we have (N-par-N)- (N-N).</Paragraph> <Paragraph position="4"> * non-compositional compounds: some compounds' meaning cannot be derived from the meaning of their components.</Paragraph> <Paragraph position="5"> For example, the Japanese term Chi *Dian aka* ten (failing grade, lit. &quot;red point&quot;) translates into French as note d'echec (lit.</Paragraph> <Paragraph position="6"> failing grade) or simply echec (lit. failure). null * lexical divergence: source and target MWTs can use different lexica to express a concept. For example, traduction automatique (machine translation, lit.</Paragraph> <Paragraph position="7"> &quot;automatic translation&quot;) translates as Ji Jie *Fan Yi kikai * honyaku (lit. machine translation).</Paragraph> <Paragraph position="8"> It is hard to imagine any method that could address all these problems accurately.</Paragraph> <Paragraph position="9"> Tanaka and Baldwin (2003) found that 48.7% of English-Japanese Noun-Noun compounds translate compositionality. In a preliminary experiment, we found this to be the case for as much as 75.1% of the collected MWTs. If we are to maximize the coverage of our system, it is sensible to start with a compositional approach. We will not deal with the problem of fertility and non-compositional compounds in this paper.</Paragraph> <Paragraph position="10"> Nonetheless, lexical divergence and variability issues will be partly tackled by broader translations and related words given by thesauri.</Paragraph> </Section> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Evaluation </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.1 Linguistic Resources </SectionTitle> <Paragraph position="0"> The bilingual dictionaries used in the experiments are the Crown French-Japanese Dictionary (Ohtsuki et al. (1989)), and the French-Japanese</Paragraph> </Section> <Section position="2" start_page="2" end_page="2" type="sub_section"> <SectionTitle> Scientific Dictionary (French-Japanese Scientific </SectionTitle> <Paragraph position="0"> Association (1989)). The former contains about 50,000 entries of general usage single words.</Paragraph> <Paragraph position="1"> The latter contains about 50,000 entries of both single and multi-word scientific terms. These two complement each other, and by combining both entries we form our base dictionary to which we refer as Dic</Paragraph> <Paragraph position="3"> The main thesaurus used is Bunrui Goi Hyo (National Institute for Japanese Language (2004)). It contains about 96,000 words, and each entry is organized in two levels: a list of synonyms and a list of more loosely related words. We augment the initial translation set by looking up the Japanese words given by Dic combined with the more loosely related words is denoted Dic</Paragraph> <Paragraph position="5"> Finally, we build another thesaurus from a Japanese-English dictionary. We use Eijiro (Electronic Dictionary Project (2004)), which contains 1,290,000 entries. For a given Japanese entry, we look up its English translations. The Japanese translations of the English intermediaries are used as synonyms/related words of the entry. The resulting thesaurus is expected to provide even more loosely related translations (and also many irrelevant ones). We denote it Dic</Paragraph> <Paragraph position="7"/> </Section> <Section position="3" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.2 Notation </SectionTitle> <Paragraph position="0"> Let F and J be the two sets of related terms collected in French and Japanese. F' is the subset of F for which Jac[?]0.01: {}01.0)(' [?][?]= fJacFfF F'* is the subset of valid related terms in F', as determined by human evaluation. P is the set of all potential translation pairs among the collected terms (P=FxJ). P' is the set of pairs containing either a French term or a Japanese term with</Paragraph> <Paragraph position="2"> P'* is the subset of valid translation pairs in P', determined by human evaluation. These pairs need to respect three criteria: 1) contain valid terms, 2) be related to the seed, and 3) constitute a valid translation. M is the set of all translations selected by our system. M' is the subset of pairs in M with Jac[?]0.01 for either the French or the Japanese term. It is also the output of our system: { }01.0)(01.0)(),(' [?][?][?][?]= jJacfJacMjfM M'* is the intersection of M' and P'*, or in other words, the subset of valid translation pairs output by our system.</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.3 Baseline Method </SectionTitle> <Paragraph position="0"> Our starting point is the simplest possible alignment, which we refer to as our baseline. It is worked out by using each of the aforementioned dictionaries independently. The output set ob- null is denoted FJJ, and so on. The experiment is made using the eight seed pairs given in Table 1. On average, we have |F' |=74.3, |F'*|=51.0 and |P'*|=24.0. Table 2 gives a summary of the key results. The precision and the recall are given by:</Paragraph> <Paragraph position="2"> contains only Japanese translations corresponding to the strict sense of French elements. Such a dictionary generates only a few translation candidates which tend to be correct when present in the target set. On the other hand, the MWT elements with more laxity, generating more translations and thus more alignments, at the cost of some precision.</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.4 Incremental Selection </SectionTitle> <Paragraph position="0"> The progressive increase in recall given by the increasingly looser translations is in inverse proportion to the decrease in precision, which hints that we should give precedence to the alignments obtained with the more accurate methods. Consequently, we start by adding the alignments in FJ to the output set. Then, we augment it with the alignments from FJJ whose terms are not already in FJ. The resulting set is denoted FJJ'.</Paragraph> <Paragraph position="1"> We then augment FJJ' with the pairs from FJJ2 whose terms are not in FJJ', and so on, until we exhaust the alignments in FJEJ.</Paragraph> <Paragraph position="2"> For instance, let FJ contain (synthese de la parole- Yin Sheng *He Cheng onsei * gousei (speech synthesis)) and FJJ contain this pair plus (synthese de la parole- Yin Sheng *Jie Xi onsei*kaiseki (speech analysis)). In the first iteration, the pair in FJ is added to the output set. In the second iteration, no pair is added because the output set already contains an alignment with synthese de la parole.</Paragraph> <Paragraph position="3"> Table 3 gives the results for each incremental step. We can see an increase in precision for FJJ', FJJ2' and FJEJ' of respectively 5%, 9% and 8%, compared to FJJ, FJJ2 and FJEJ. We are effectively filtering output pairs and, as expected, the increase in precision is accompanied by a slight decrease in recall. Note that, because FJEJ is not a superset of FJJ2, we see an increase in both precision and recall in FJEJ' over FJEJ. Nonetheless, the precision yielded by FJEJ' is not sufficient, which is why Dic</Paragraph> </Section> </Section> <Section position="6" start_page="2" end_page="2" type="metho"> <SectionTitle> FJEJ </SectionTitle> <Paragraph position="0"> is left out in the next experiment.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.5 Bootstrapping </SectionTitle> <Paragraph position="0"> The coverage of the system is still shy of the 20 pairs/seed objective we gave ourselves. One cause for this is the small number of valid translation pairs available in the corpora. From an average of 51 valid related terms in the source set, only 24 have their translation in the target set.</Paragraph> <Paragraph position="1"> To counter that problem, we increase the coverage of Japanese related terms and hope that by doing so, we will also increase the coverage of the system as a whole.</Paragraph> <Paragraph position="2"> Once again, we utilize the high precision of the baseline method. The average 10.5 pairs in FJ include 92% of Japanese terms semantically similar to the seed. By inputting these terms in the term collection system, we collect many more terms, some of which are probably the translations of our French MWTs.</Paragraph> <Paragraph position="3"> The results for the baseline method with bootstrapping are given in Table 4. The ones using incremental selection and bootstrapping are given in Table 5. FJ + consists of the alignments given by a generation process using Dic</Paragraph> </Section> </Section> <Section position="7" start_page="2" end_page="2" type="metho"> <SectionTitle> FJ </SectionTitle> <Paragraph position="0"> and a selection performed on the augmented set of related terms. FJJ , and so on.</Paragraph> <Paragraph position="1"> The bootstrap mechanism grows the target term set tenfold, making it very laborious to identify all the valid translation pairs manually. Consequently, we only evaluate the pairs output by the system, making it impossible to calculate recall. Instead, we use the number of valid translation pairs as a makeshift measure. Bootstrapping successfully allows for many more translation pairs to be found. FJ</Paragraph> <Paragraph position="3"> respectively contain 7.6, 8.7 and 8.5 more valid alignments on average than FJ, FJJ and FJJ2. The augmented target term set is noisier than the initial set, and it produces many more invalid alignments as well. Fortunately, the incremental selection effectively filters out most of the unwanted, restoring the precision to acceptable levels.</Paragraph> <Section position="1" start_page="2" end_page="2" type="sub_section"> <SectionTitle> 4.6 Analysis </SectionTitle> <Paragraph position="0"> A comparison of all the methods is illustrated in the precision - valid alignments curves of Figure 2. The points on the four curves are taken from Tables 2 to 5. The gap between the dotted and filled curves clearly shows that bootstrapping increases coverage. The respective positions of the squares and crosses show that incremental selection effectively filters out erroneous alignments. FJJ + '', with 19.6 valid alignments and a precision of 81%, is at the rightmost and uppermost position in the graph. The detailed results for each seed are presented in Table 6, and the complete output for the seed &quot;logic circuit&quot; is given in Table 7.</Paragraph> <Paragraph position="1"> From the average 4.7 erroneous pairs/seed, 3.2 (68%) were correct translations but were judged unrelated to the seed. This is not surprising, considering that our set of French related terms contained only 69% (51/74.3) of valid related terms. Also note that, of the 24.3 pairs/seed output, 5.25 are listed in the French-Japanese Scientific Dictionary. However, only 3.9 of those pairs are included in M'*. The others were deemed unrelated to the seed.</Paragraph> <Paragraph position="2"> In the output set of &quot;machine translation&quot;, Zi Ran *Yan Yu *Chu Li shizen* gengo* shori (natural language processing) is aligned to both traitement du language naturel and traitement des langues naturelles. The system captures the term's variability around langue/language. Lexical divergence is also taken into account to some extent. The seed computational linguistics yields the alignment of langue maternelle (mother tongue) with Mu Guo *Yu bokoku* go (literally [[mothercountry]-language]). The usage of thesauri enabled the system to include the concept of country in the translated MWT, even though it is not present in any of the French elements.</Paragraph> </Section> </Section> class="xml-element"></Paper>