File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/abstr/02/c02-1166_abstr.xml

Size: 13,553 bytes

Last Modified: 2025-10-06 13:42:22

<?xml version="1.0" standalone="yes"?>
<Paper uid="C02-1166">
  <Title>An approach based on multilingual thesauri and model combination for bilingual lexicon extraction</Title>
  <Section position="2" start_page="0" end_page="0" type="abstr">
    <SectionTitle>
Abstract
</SectionTitle>
    <Paragraph position="0"> This paper focuses on exploiting different models and methods in bilingual lexicon extraction, either from parallel or comparable corpora, in specialized domains.</Paragraph>
    <Paragraph position="1"> First, a special attention is given to the use of multilingual thesauri, and different search strategies based on such thesauri are investigated. Then, a method to combine the different models for bilingual lexicon extraction is presented. Our results show that the combination of the models significantly improves results, and that the use of the hierarchical information contained in our thesaurus, UMLS/MeSH, is of primary importance. Lastly, methods for bilingual terminology extraction and thesaurus enrichment are discussed.</Paragraph>
    <Paragraph position="2"> Introduction The growing availability of comparable corpora, through the Internet or via distribution agencies providing newspapers articles in different languages, has led researchers to develop methods to extract bilingual lexicons from such corpora, in order to enrich existing bilingual dictionaries, and help cross the language barrier for cross-language information retrieval. The results obtained thus far on comparable corpora, even though encouraging, are not completely satisfactory yet. (Fung, 2000) reports, for the Chinese-English language pair an accuracy of 76% to find the correct translation in the top 20 candidates, a figure we do not believe to be good enough to consider manual revision.</Paragraph>
    <Paragraph position="3"> Furthermore, the evaluation is carried out on 40 English words only. (Rapp, 1999) reaches 89% on the German-English language pair, when considering the top 10 candidates. If this figure is rather high, it was obtained on a set of 100 German words, which, even though not explicit in Rapp's paper, seem to be high frequency words, for which accurate and reliable statistics can be obtained.</Paragraph>
    <Paragraph position="4"> We want to show in this paper how previously proposed methods can be extended to and improved for specialized domains. In particular we will focus on the use and enrichment of multilingual thesauri, which, even though partially related they may be to the texts under consideration, are nonetheless an available and valuable resource for the task. We rely in this work on two main linguistic resources: a general bilingual dictionary (available through the ELRA consortium1) and a specialized multilingual thesaurus (the Medical Subject Headings, MeSH, provided through the metathesaurus Unified Medical Language System, UMLS2). Without anticipating too much on the linguistic preprocessing we use, it has to be noted that, unless otherwise stated, when we speak of a &amp;quot;word&amp;quot; we refer to a single (as opposed to compound), lexical word (as opposed to stop word). All our examples and experiments use the (German, English) language pair.</Paragraph>
    <Paragraph position="5"> 1 Context vectors: a basic building block Bilingual lexicon extraction from non-parallel but comparable corpora has been studied by a number of researchers, (Peters, 1995; Tanaka, 1996; Shahzad 1999; Rapp, 1999; Fung, 2000) among others. Their work relies on the assumption that if two words are mutual  translations, then their more frequent collocates (taken here in a very broad sense) are likely to be mutual translations as well. Based on this assumption, a standard approach consists in building context vectors, for each source and target word, which aim at capturing the most significant collocates. The target context vectors are then translated using a general bilingual dictionary, and compared with the source context vectors.</Paragraph>
    <Paragraph position="6"> Our implementation of this strategy relies on the following steps, and follows the one given in (Rapp, 1999): - for each word w, build a context vector by considering all the words occurring in a window encompassing several sentences that is run through the corpus. Each word i in the context vector of w is then weighted with a measure of its association with w. We chose the log-likelihood ratio test, (Dunning, 1993), to measure this association - the context vectors of the target words are then translated with our general bilingual dictionary, leaving the weights unchanged (when several translations are proposed by the dictionary, we consider all of them with the same weight) - the similarity of each source word s, for each target word t, is computed on the basis of the cosine measure - the similarities are then normalized to yield a probabilistic translation lexicon, P(t|s).</Paragraph>
    <Paragraph position="7"> To illustrate the above steps, we give here the first 5 words of the context vector of the German word Leber (liver), together with their associated score: (Transplantation 138, Resektion 53, Metastase 41, Arterie 38, cirrhose 26). Once this context vector translated, the English top five becomes: (transplant 138, tumour 48, secondary 42, metastatis 41, artery 38). One can note that the German term Resektion was not found in our bilingual dictionary, and thus not translated.</Paragraph>
    <Paragraph position="8"> However, the translated context vector contains English terms characteristic of the co-occurrence pattern for liver, allowing one to associate the two words Leber and liver. We refer to the above method as the standard method.</Paragraph>
    <Paragraph position="9"> 2 Lexical translation model based on a multilingual thesaurus A multilingual thesaurus bridges several languages through cross-language correspondences between concept classes (a concept class in the thesaurus links alternative names and views of the same concept together.</Paragraph>
    <Paragraph position="10"> For example, concept class C0751521, for which the main entry is splenic neoplasms, also contains cancer of spleen, splenic cancer, spleen neoplasms). The correspondence can be one-toone, i.e. the same concept classes are used in the different languages, or many-to-many, i.e.</Paragraph>
    <Paragraph position="11"> different concept classes are used in different languages, and a given concept class in a given language corresponds to zero, one or more concept classes in the other languages. The correspondence between concept classes across languages helps us write the probability P(t|s) of selecting word t as a translation of word s in the following general way, where C represents a multilingual concept class in MeSH (we omit the derivation, which is mainly technical, and uses the fact that the correspondence between concept classes in MeSH is one-to-one):</Paragraph>
    <Paragraph position="13"> a formula which can be interpreted as follows: from a source word s of the source corpus, select a (interlingual) concept class in the thesaurus, according to P(C|s), then generate a target word t of the target corpus from the concept class and the source word, according to P(t|C,s). The dependence on s in the last probability distribution (P(t|C,s)) allows one to privilege one possible lexicalization of a given concept class. It could be used, for example, to choose spleen neoplasms from concept class C0751521 as the translation of Milztumoren. However, since such a distinction between the different lexicalizations of a given concept is beyond the scope of the current paper, we make the additional simplifying assumption that, given a concept class, the target word t is independent of the source word s, which leads to the simplified formula:</Paragraph>
    <Paragraph position="15"> The above equation views the thesaurus as a trellis linking source and target words. As such, given probabilities P(C|s) and P(t|C) (see section 2.2 for the way we estimate these probabilities), there are several ways to compute an association score between source and target words. The most obvious one is to carry the sum over all concept classes, or a large subset of them, as indicated by the formula. We refer to this method as the complete search. However, if the relation between a word and a concept class is not significant, the complete search has the disadvantage of bringing noisy data in the estimation of P(t|s). An alternate solution is to select just the concept class which maximizes the association between s and t. Because of its analogy with the Viterbi algorithm, we refer to this method as the Viterbi search.</Paragraph>
    <Paragraph position="16"> Nevertheless, neither the complete nor the Viterbi search makes use of the hierarchical information contained in the thesaurus, which is, in the above formulations, mainly viewed as a specialized lexicon. We present below a third search strategy which directly makes use of the structure of the thesaurus. For reasons that will become clear, we call this strategy the subtree search.</Paragraph>
    <Section position="1" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.1 The subtree search
</SectionTitle>
      <Paragraph position="0"> Complete search and the Viterbi search represent two extreme ways of making use of the thesaurus since they consider either all or only one of the concept classes it contains. In order to find a way in-between and to focus on a subset of interesting concept classes, we first select for each source word s the n best concept classes in the thesaurus, i.e. the first n concept classes according to the probability distribution P(C|s). We then extend this set of classes by adding new classes using the hierarchy in the thesaurus.</Paragraph>
      <Paragraph position="1"> Intuitively, if two or more classes in the selected subset have the same parent class, then the source word is likely to be related to this parent as well as to the classes themselves, since the parent is the direct node &amp;quot;conceptually'' linking the classes. For example, if a source word s selects the two classes Hepatitis and Cirrhosis, then s is likely to be related to Liver Diseases, the parent class. We make use of this intuition in the following way: for each pair of classes from the set of the n best classes associated with source word s, select the subtree formed by the classes, their common ancestor, and all the nodes that appear between the classes and their ancestor.</Paragraph>
      <Paragraph position="2"> This algorithm provides a set of subtrees from the 15 sub-thesauri corresponding to the 15 main categories of the MESH classification (MeSH, rather than being a single thesaurus, contains 15 different sub-thesauri, artificially related through a common root node in UMLS. We do not make use of this distinction in the complete and Viterbi methods, but use it for the subtree search to avoid linking classes via the artificial root concept). One can also note that the above algorithm suggests a way to identify polysemous words, or words used through different points of view, via the different sub-thesauri they select subtrees from. This refinement, which should lead to more fine-grained bilingual lexicons, will be the focus of future research.</Paragraph>
      <Paragraph position="3"> The set of classes contained in the subtrees is then used in equation (2) to derive associations between source and target words.</Paragraph>
    </Section>
    <Section position="2" start_page="0" end_page="0" type="sub_section">
      <SectionTitle>
2.2 Linking words and concept classes
</SectionTitle>
      <Paragraph position="0"> The estimation of the probability distributions P(C|s) and P(t|C) used in equation (2) can be easily carried out by resorting once again to context vectors. Indeed, if a word of the corpus is similar to a term present in a concept class, then they are likely to share similar contexts and have similar context vectors. We thus extend the notion of context vectors to concept classes, and rely again on the cosine measure to compute similarities between words and concept classes.</Paragraph>
      <Paragraph position="1"> The probability distributions P(C|s) and P(t|C) are finally derived through normalization.</Paragraph>
      <Paragraph position="2"> To build a context vector for a concept class, we first build the context vector of each term the class contains. For single-word units, we directly rely on the context vectors extracted in section 1.</Paragraph>
      <Paragraph position="3"> If the term is a multi-word unit, as liver disease, we consider the conjunction of the context vectors of each word in the unit, normalizing the weights by the number of words in the unit. For example, the context vector for liver disease will contain only those words that appear in the context of both liver and disease, since the whole unit is a narrower concept than its constituents. We then take the disjunction of all context vectors of each entry term in the class, normalizing the weights by the number of terms in the class, to build the context vector of each concept class.</Paragraph>
      <Paragraph position="4"> The following example illustrates the complete process: the German spelling variant Actinomykose is used in our corpus in addition to Aktinomykose, which is the only form listed in the UMLS class C0001261; nevertheless, our process associates C0001261 as the closest class to Actinomykose and actinomycosis (English), and retain them as translation candidates.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML