File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-1166_metho.xml

Size: 14,897 bytes

Last Modified: 2025-10-06 14:07:51

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1166">
  <Title>An approach based on multilingual thesauri and model combination for bilingual lexicon extraction</Title>
  <Section position="3" start_page="0" end_page="0" type="metho">
    <SectionTitle>
3 Combining different models
</SectionTitle>
    <Paragraph position="0"> The previous section provides us with two different probabilistic lexical translation models: one derived from the standard method, and one based on the bilingual thesaurus. A third lexical translation model can be directly derived from the bilingual dictionary by considering the different translations of a given entry as equiprobable. For example, our dictionary associates abbilden with the two words depict and portray, thus P(depict|abbilden) = P(portray|abbilden) = 0.5. Note that these three models are not independent of each other, since the corpus is used, through the estimation of P(C|s) and P(t|C), in the thesaurus-based model, and the bilingual dictionary is used for translating context vectors in the corpus-based model. The final estimate of P(t|s) is then based on the following mixture of models:</Paragraph>
    <Paragraph position="2"> where i is an integer used to index the different models (here 1 a4 i a5 3), and P(i|f(s)) denotes the probability of selecting model i based on characteristics of s (f is a function mapping the source word to a set of relevant features) . The problem is now one of estimating the mixture weights P(i|f(s)), which can be done by maximizing the likelihood of some held-out data. To this end, we manually created a reference bilingual lexicon, part of which is reserved for estimating the mixture weights. Let l denote the part of the reference lexicon we use for estimation purposes, and l(s) the set of translations of s in l. The mixture weights are obtained through a standard constrained optimization problem, and are given by:  The set of features we retained aim at capturing the reliability of each model for a given source word. The reliability of the standard method can be indirectly measured through the frequency of s, the more frequent s is, the more reliable the information available to this method is. We capture this with a binary valued attribute, being 1 if s occurs at least 5 times in our corpus, and 0 otherwise. Similarly, the reliability of the thesaurus-based model uses a binary valued attribute which is 1 if s is close to the thesaurus (i.e. if )|(maxarg sCPC is greater than 0.5) and 0 otherwise. For the dictionary-based model, the reliability is directly computed in terms of presence/absence of s in the dictionary. The above thresholds were empirically tuned, and constitute what we believe to be a good compromise between fine-grained mixtures and data sparseness problems.</Paragraph>
    <Paragraph position="3"> Nevertheless, despite this tuning, some configurations of the above attributes still suffer estimation problems. Starting with a reference lexicon containing 1,800 translation pairs, we used 10 different splits into estimation and evaluation lexicons (two third of the data are reserved for estimation, one third for evaluation), and then estimated the mixture weights on each split. The results show that the variance for the configuration &amp;quot;low frequency, not in thesaurus, not in dictionary&amp;quot; is 10 times larger than the variance obtained for the other 7 configurations. Unfortunately, many source words fall into this configuration. We thus decided to fall back on a simplified version of equation 3 in which the dependence of i on f(s) is dropped (the adaptation of equation 4 is straightforward). This time, the variance is around 410a11 , 5 times lower than the lowest value previously obtained, thus rendering the estimation of the mixture weights more reliable.</Paragraph>
    <Paragraph position="4"> Table 1 below presents the mixture weights finally obtained for the different search methods.</Paragraph>
  </Section>
  <Section position="4" start_page="0" end_page="0" type="metho">
    <SectionTitle>
4 Linguistic preprocessing
</SectionTitle>
    <Paragraph position="0"> As a preprocessing step, we tag and lemmatize texts in both languages. This step allows us to focus on content words only (nouns, verbs, adjectives and adverbs), and reduces the noise in our model (content words are the primary focus for thesaurus enrichment and cross-language information retrieval). Nevertheless, since we use the (German, English) language pair for all our experiments, a major problem still resides in the difference in the word definition between the two languages, mainly due to the particular usage of compounding the German language has. Two alternatives are offered: either use a direct phrasal alignment, or decompose the German compounds into smaller units.</Paragraph>
    <Paragraph position="1"> Inasmuch as the models presented in the preceding sections implicitly assume a one-to-one correspondence between words in the two languages, we rely on the second strategy.</Paragraph>
    <Paragraph position="2"> However, an additional complication is introduced by the fact that our corpora belong to the medical domain, thus leaving our German lemmatizer clueless when it comes to decomposing medical compounds. We thus used two additional heuristics, recursively applied on all German words: - some sequences, e.g. -ungs-, -heits-, -keits-, schafts, -aets- and -ions-, as well as their plural forms, are considered as boundaries between two words in a compound, and break a word into  two parts - if a word is composed of the sequence AB, and if A and B are both longer than 3 characters  and both occur in the corpus, then the sequence AB is decomposed into A and B.</Paragraph>
    <Paragraph position="3"> The above heuristics reduce the number of different lemmas in the German vocabulary by 28% (from 14,700 to 10,500), while not hurting too much the quality of the vocabulary since their precision is estimated to be above 90%. For example, they allow us to accurately decompose the compound Adhaesionsileusbehandlung into the three parts Adhaesion, Ileus and Behandlung.</Paragraph>
  </Section>
  <Section position="5" start_page="0" end_page="0" type="metho">
    <SectionTitle>
5 Experiments and results
</SectionTitle>
    <Paragraph position="0"> To test the above models and their combination, we used roughly 700 abstracts from MEDLINE3, in German and English (each portion, German and English, contains approximately 100,000 words). These abstracts are &amp;quot;partial&amp;quot; translations of each other, because in some cases the English writer directly summarizes the articles in English, rather than translating the German abstracts. That set of abstracts is used both as  our comparable corpus, in which case we do not make use of alignment information, and as our parallel corpus (see section 6). There is a continuum from parallel corpora to fully unrelated texts, going through comparable corpora. The comparable corpus we use is in a way &amp;quot;ideal&amp;quot; and is biased in the sense that we know the translation of a German word of the German corpus to be, almost certainly, present in the English corpus. However, this bias, already present in previous works, does not impact the comparison of the methods we are interested in, all methods being equally affected. Indeed, the results we obtain with the standard method (see below) are in the range of those reported in previous works.</Paragraph>
    <Paragraph position="1"> As already mentioned, we manually extracted a reference lexicon comprising 1,800 translation pairs from our comparable corpus. From this, we reserve approximately 1,200 pairs for estimating the mixture weights, and 600 for the evaluation proper. All our results are averaged over 10 different such splits. Since the models we rely on yield a ranked set of translation candidates for each source word, and since one cannot expect the right translation to be the first candidate, we compute precision and recall of each method in the following way: for each pair (s,t) in the evaluation lexicon, we consider the first p candidates provided for s by the method under evaluation, and judge the set as correct if it contains t, as incorrect otherwise; precision is then obtained by dividing the number of correct sets by the number of sets proposed by the method for the words in the evaluation lexicon, whereas recall is obtained by dividing the number of correct sets by the number of pairs in the evaluation lexicon. In addition, we evaluate the average rank of the first correct translation in the proposed list of translations, for each method.</Paragraph>
    <Paragraph position="2"> Table 2 shows the results we obtained on our comparable corpora, for p=10, without combining the different models. ST50 refers to the subtree search strategy within the thesaurus, with n=50. The precision of the dictionary-based model is around 78%, which is not that bad considering the domain we focused on, but, as one can expect, its recall reaches only 48%.</Paragraph>
    <Paragraph position="3"> The F1-score, which combines precision and recall, obtained for the corpus-based model is similar to the ones obtained in previous works.</Paragraph>
    <Paragraph position="4">  obtained with the different search strategies for the thesaurus-based model: the Viterbi search , the complete one considering the first 100 and first 200 concept classes for each source word, and the subtree search with different values of n), and two different values for p, 5 and 10. The average rank is given next to each F1-score. As one can see, the combination significantly improves the results over the models alone, since the F1-score goes from 62% to 84%, a score that may be good enough to consider manual revisions.</Paragraph>
    <Paragraph position="5">  Furthermore, the best results are obtained with the subtree search, with n=20, thus validating our hypothesis that using the structure of the thesaurus is beneficial. One can note however that the results obtained with the complete search using 200 classes are close to the best results. Nevertheless, the optimal subtree search (ST20) uses 7.5 times less classes than the complete search, and is also two times faster.</Paragraph>
    <Paragraph position="6"> This proves that the subtree search is able to focus on accurate concept classes in the thesaurus, whereas the complete search needs considering more classes to reach a comparable level of performance. Interestingly, it also seems that the candidates provided by the subtree search closely correspond to a semantic field, whereas the ones given by the complete search are more varied. Where this to be the case, the subtree search would also certainly outperform the other methods when used for cross-language information retrieval. We will try to validate this hypothesis in future work.</Paragraph>
  </Section>
  <Section position="6" start_page="0" end_page="0" type="metho">
    <SectionTitle>
6 Bilingual terminology extraction
</SectionTitle>
    <Paragraph position="0"> Bilingual terminology extraction is based on three steps: word alignment, term extraction term alignment.</Paragraph>
    <Paragraph position="1"> In this section, we rely on the word to word translation lexicon obtained from the parallel corpus, following the method described in (Gaussier et al., 2000).</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.1 Term extraction
</SectionTitle>
      <Paragraph position="0"> For identifying German and English candidate terms we use the following patterns, similar to those proposed by (Heid, 1999) and (Blanck, 2000): 1. single words which appear in the thesaurus (for alignment purposes) or which contain English morphemes extracted from The Specialized Lexicon found in UMLS and translated in German .</Paragraph>
      <Paragraph position="1"> 2. syntactic patterns: [(ADJ)+ NOUN GEN+] and [ADJ+ NOUN (GEN)+] for German, and all non recursive noun phrases for English.</Paragraph>
      <Paragraph position="2"> Our morpheme list contains 40 elements, some of which are general, -ion, -ung, but the majority of which is specific to the medical domain, ektomie, -itis. The syntactic patterns match nouns which occur with a complement (adjective and/or genitive structures). The German sequence problematosen Gebieten der Chirurgie is then defined as a candidate term, when the English translation problematic fields of surgery is composed of two candidate terms: problematic fields and surgery.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
6.2 Term alignment
</SectionTitle>
      <Paragraph position="0"> Our algorithm allows alignment of a sequence of candidate terms, and follows the one proposed in (Hull, 1997). We first try to align candidate terms, and then test if a longer unit, composed of several candidate terms, improve the alignment score. A unit is extended if and only if the next contiguous candidate term is a prepositional phrase, the relaxation of this constraint introducing too much noise. The extension stops when the score is lower than the score of the &amp;quot;non-extended term&amp;quot;. For instance, an alignment score is computed for [problematosen Gebieten der Chirurgie] and [problematic fields]. Then the English term is extended to [[problematic fields] of [surgery]], which provides a better alignment score, and is then kept. In this particular example, neither the German nor the English units can be further extended, since the German term occurs at the end of a sentence and the English unit is not followed by a prepositional phrase. The German candidate problematosen Gebieten der Chirurgie is thus finally aligned with the English candidate problematic fields of surgery.</Paragraph>
      <Paragraph position="1"> Most German compounds, decomposed for word alignment purposes, are aligned with English terms corresponding to a sequence adjective+noun (Nierenfunktion/renal function) or noun+of+noun (Lebensqualitaet/quality of live). Correspondences between acronyms and translated developed forms can also be found (Nierenzellcarcinom/RCC). In practice, no unit composed of three candidate terms is found. The longest units are generated by German candidate term with a genitive structure (Plattenepithelcarcinom des Oesophagus/squamous cell esophageal cancer).</Paragraph>
      <Paragraph position="2"> We manually extracted 150 candidate terms with their translation for evaluating our procedure.</Paragraph>
      <Paragraph position="3"> Table 4 shows precision and recall for our method. If the first 5 candidates are retained, the F1-score reaches 80%. Precision is always higher than recall, which can be explained by the fact that the reference terms were extracted manually when the automatic extraction can propose incorrect units due to chunking errors.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML