File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1002_abstr.xml
Size: 12,513 bytes
Last Modified: 2025-10-06 13:42:17
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-1002"> <Title>A cheap and fast way to build useful translation lexicons</Title> <Section position="1" start_page="0" end_page="2" type="abstr"> <SectionTitle> Abstract </SectionTitle> <Paragraph position="0"> The paper presents a statistical approach to automatic building of translation lexicons from parallel corpora. We briefly describe the pre-processing steps, a baseline iterative method, and the actual algorithm. The evaluation for the two algorithms is presented in some detail in terms of precision, recall and processing time. We conclude by briefly presenting some of our applications of the multilingual lexicons extracted by the method described herein.</Paragraph> <Paragraph position="1"> Introduction The scientific and technological advancement in many domains is a constant source of new term coinage and therefore keeping up with multilingual lexicography in such areas is very difficult unless computational means are used. Translation lexicons, based on translation equivalence relation are lexical knowledge sources, which can be extracted from parallel texts (even from comparable texts), with very limited human resources. The translation lexicons appear to be quite different from the corresponding printed lexicons, meant for the human users. There are well known reasons for these differences and we will not discuss the issue here, but exactly these differences make them very useful (in spite of inherent noise content) in many computer-based applications.</Paragraph> <Paragraph position="2"> We will discuss some of our experiments based on automatically extracted multilingual lexicons. Most modern approaches to automatic extraction of translation equivalents rely on statistical techniques and roughly fall into two categories. The hypotheses-testing methods such as Gale and Church (1991), Smadja et al. (1996), Tiedmann (1998), Ahrenberg (2000), Melamed (2001) etc. use a generative device that produces a list of translation equivalence candidates (TECs), extracted from corresponding segments of the parallel texts (translation units-TU), each of them being subject to an independence statistical test. The TECs that show an association measure higher than expected under the independence assumption are assumed to be translation-equivalence pairs (TEPs). The TEPs are extracted independently one of another and therefore the process might be characterized as a local maximization (greedy) one. The estimating approach such as Brown et al. (1993), Kay and Roscheisen (1993), Kupiec (1993), Hiemstra (1997) etc. is based on building from data a statistical bitext model, the parameters of which are to be estimated according to a given set of assumptions. The bitext model allows for global maximization of the translation equivalence relation, considering not individual translation equivalents but sets of translation equivalents (sometimes called assignments). There are pros and cons for each type of approach, some of them discussed in Hiemstra (1997).</Paragraph> <Paragraph position="3"> Our translation equivalents extraction process may be characterized as a &quot;hypotheses testing&quot; approach and does not need a pre-existing bilingual lexicon for the considered languages. If such a lexicon exists it can be used to eliminate spurious candidate translation equivalence pairs and thus to speed up the process and increase its accuracy.</Paragraph> <Paragraph position="4"> 1 Assumptions, preprocessing and a baseline There are several underlying assumptions one may consider in keeping the computational complexity of a translation lexicon extraction algorithm as low as possible. None of these hopotheses is true in general, but the situations where they are not observed are rare enough so that ignoring the exceptions would not produce a significant number of errors and would not lose too many useful translations. The assumptions we made were the following: * a lexical token in one half of the translation unit (TU) corresponds to at most one non-empty lexical unit in the other half of the TU; this is the 1:1 mapping assumption which underlines the work of many other researchers (Ahrenberg et al (2000), Brew and McKelvie (1996), Hiemstra (1996), Kay and Roscheisen (1993), Tiedmann (1998), Melamed (2001) etc); * a polysemous lexical token, if used several times in the same TU, is used with the same meaning; this assumption is explicitly used by Gale and Church (1991), Melamed (2001) and implicitly by all the previously mentioned authors; * a lexical token in one part of a TU can be aligned to a lexical token in the other part of the TU only if the two tokens have compatible types (part-of-speech); in most cases, compatibility reduces to the same POS, but it is also possible to define other compatibility mappings (e.g.</Paragraph> <Paragraph position="5"> participles or gerunds in English are quite often translated as adjectives or nouns in Romanian and vice-versa); * although the word order is not an invariant of translation, it is not random either (Ahrenberg et al (2000)); when two or more candidate translation pairs are equally scored, the one containing tokens which are closer in relative position are preferred.</Paragraph> <Paragraph position="6"> The proper extraction of translation equivalents requires special pre-processing: * sentence alignment; we used a slightly modified version of CharAlign described by Gale and Church (1993) .</Paragraph> <Paragraph position="7"> * tokenization; the segmenter we used (MtSeg, developed by P. di Cristo for the MULTEXT project: http://www.lpl.univ-aix.fr/projects/multext/ MtSeg/), may process multiword expressions as single lexical tokens. The segmenter comes with tokenization resources for several Western European languages, further enhanced in the MULTEXT-EAST project (Dimitrova et al (1998), Erjavec et al (1998), Tufis et al (1998)) with corresponding resources for Bulgarian, Czech, Estonian, Hungarian, Romanian and Slovene.</Paragraph> <Paragraph position="8"> * tagging and lemmatization; we used a tiered tagging with combined language models approach (Tufis (1999, 2000)) based on a Brants's TnT tagger.</Paragraph> <Paragraph position="9"> After the sentence alignment, tagging and lemmatization, the first step is to compute a list of translation equivalence candidates (TECL).</Paragraph> <Paragraph position="10"> This list contains several sub-lists, one for each POS considered in the extraction procedure.</Paragraph> <Paragraph position="11"> Each POS-specific sub-list contains several pairs of tokens <token</Paragraph> <Paragraph position="13"> > of the corresponding POS that appeared in the same TUs. TECL contains a lot of noise and many TECs are very improbable. In order to eliminate much of this noise, the most unlikely candidates are filtered out of TECL. The filtering is based on scoring the association between the tokens in a TEC.</Paragraph> <Paragraph position="14"> For the ranking of the TECs and their filtering we experimented with 4 scoring functions: MI (pointwise mutual information), DICE, LL (loglikelihood), and kh (chi-square). After empirical tests we decided to use LL test with the threshold value set to 9.</Paragraph> <Paragraph position="15"> Our baseline algorithm, BASE, is a very simple iterative algorithm, very fast and can be enhanced in many ways. It has some similarities to the iterative algorithm presented in Ahrenberg et al (2000) but unlike it, our algorithm avoids computing various probabilities (or better said probability estimates) and scores (t-score). At each iteration step, the pairs that pass the selection (see below) will be removed from TECL so that this list is shortened after each step and eventually may be emptied. Based on TECL, for each POS a S</Paragraph> <Paragraph position="17"> the number of token types in the first part of the bitext (call it source) and T n the number of token types in the other part of the bitext (call it target). Source token types index the rows of the table and the target token types (of the same POS) index the columns. Each cell (i,j) contains the number of occurrences in TECL of the <T</Paragraph> <Paragraph position="19"> This is the key idea of the iterative extraction algorithm: it expresses the requirement that in order to select a TEC <T (p[?]j). The same holds for the other way around. All the pairs selected in TP k are removed (the respective counts are zeroed). If T Si is translated in more than one way the rest of translations will be found in subsequent steps (if frequent enough). The most used translation of a token T Si will be found first. The equation (2) represents a frequency relevance threshold, necessary in order to diminish the influence of data sparseness.</Paragraph> <Paragraph position="20"> 2 An improved algorithm (BETA) One of the main deficiencies of the BASE algorithm is that it is sensitive to what Melamed (2001) calls indirect associations. If <T Although, as observed by Melamed (2001), in general, the indirect associations have lower scores than the direct (correct) associations, they could receive higher scores than many correct pairs and this will not only generate wrong translation equivalents, but will eliminate from further considerations several correct pairs, deteriorating the procedure's recall. To weaken this sensitivity, the BASE algorithm had to impose that the number of occurrences of a TEC be at least 3, thus filtering out more than 50% of all the possible TECs. Still, because of the indirect association effect, in spite of a very good precision (more than 98%) out of the considered pairs another approximately 50% correct pairs were missed. The BASE algorithm has this deficiency because it looks on the association scores globally, and does not check within the TUs if the tokens making the indirect association are still there.</Paragraph> <Paragraph position="21"> To diminish the influence of the indirect associations and consequently removing the frequency threshold, we modified the BASE algorithm so that the maximum score is not considered globally but within each of the TUs. This brings BETA closer to the competitive linking algorithm described in Melamed (2001). The competing pairs are only the TECs generated from the current TU and the one with the best score is the first selected. Based on the 1:1 mapping hypothesis, any TEC containing the tokens in the winning pair are discarded. Then, the next best scored TEC in the current TU is selected and again the remaining pairs that include one of the two tokens in the selected pair are discarded. The multiple-step control in BASE, where each TU was scanned several times (equal to the number of iteration steps) is not necessary anymore. The BETA algorithm will see each TU unit only once but the TU is processed until no further TEPs can be reliably extracted or TU is emptied. This modification improves both the precision and recall in comparison with the BASE algorithm. In accordance with the 1:1 mapping hypothesis, when two or more TEC pairs of the same TU share the same token and they are equally scored, the algorithm has to make a decision and choose only one of them. If there exists a seed lexicon and one of the competitors is in this lexicon it will be the winner. Otherwise, decision is made based on two heuristics: string similarity scoring and relative distance.</Paragraph> <Paragraph position="22"> The similarity measure we used, COGN(T empirically set to 0.42. This value depends on the pair of languages in the considered bitext. The actual implementation of the COGN test considers a language dependent normalization step, which strips some suffixes, discards the diacritics and reduces some consonant doubling etc. This normalization step was hand written, but, based on available lists of cognates, it could be automatically induced.</Paragraph> <Paragraph position="23"> The second filtering condition, DIST(T</Paragraph> <Paragraph position="25"> based on the difference between the relative positions in the TU of the T</Paragraph> <Paragraph position="27"> respectively. The threshold for the DIST(T ), so that the TEC with the highest similarity score is the preferred one. If the similarity score is irrelevant, the weaker filter</Paragraph> <Paragraph position="29"> ) gives priority to the pairs with the smallest relative distance between the constituent tokens.</Paragraph> </Section> class="xml-element"></Paper>