File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/02/c02-2020_metho.xml
Size: 12,292 bytes
Last Modified: 2025-10-06 14:07:52
<?xml version="1.0" standalone="yes"?> <Paper uid="C02-2020"> <Title>Looking for candidate translational equivalents in specialized, comparable corpora</Title> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 3 Collecting comparable medical corpora </SectionTitle> <Paragraph position="0"> The material for the present experiments consists of comparable medical corpora in French and English and a French-English medical lexicon (Fung and Yee (1998) call its words 'seed words').</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.1 'Signs and Symptoms' Corpora </SectionTitle> <Paragraph position="0"> We selected two medical corpora from Internet catalogs of medical web sites. Some of these catalogs index web pages with controlled vocabulary keywords taken from the MeSH thesaurus (www.nlm.nih.gov/mesh/meshhome), among which CISMeF (French language medical web sites, www.chu-rouen.fr/cismef) and CliniWeb (English language medical web sites, www.ohsu.edu/cliniweb). The MeSH thesaurus is hierarchically structured, so that it is easy to select a subfield of medicine. We chose the subtree under the MeSH concept 'Pathological Conditions, Signs and Symptoms' ('C23'), which is the best represented in CISMeF.</Paragraph> <Paragraph position="1"> We compiled the 2,338 URLs indexed by CISMeF under that concept, and downloaded the corresponding pages, plus the pages directly linked to them, so that framesets or tables of contents be expanded. 9,787 pages were converted into plain text from HTML or PDF, yielding a 602,484-word corpus (41,295 unique words). The initial pages should all be in French; the additional pages sometimes happen to be foreign language versions of the initial ones. In the same line, we collected 2,019 pages under 921 URLs indexed by CliniWeb, and obtained a 608,320-word English medical corpus (32,919 unique words).</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 3.2 Base bilingual medical lexicon </SectionTitle> <Paragraph position="0"> A base French-English lexicon of simple words was compiled from several sources. On the one hand, an online French medical dictionary (Dictionnaire Medical Masson, www.atmedica.com) which includes English translations of most of its entries. On the other hand, some international medical terminologies which are available in both English and French. We obtained these from the UMLS metathesaurus, which includes French versions of MeSH, WHOART, ICPC and their English counterparts (www.nlm.nih.gov/research/umls). The resulting lexicon (see excerpt in table 1) contains 18,437 entries, mainly specialized medical terms.</Paragraph> <Paragraph position="1"> When several translations of the same term are available, they are all listed.</Paragraph> </Section> </Section> <Section position="5" start_page="0" end_page="11" type="metho"> <SectionTitle> 4 Methods </SectionTitle> <Paragraph position="0"> The basis of the method is to find the target words that have the most similar distributions with a given source word. We explain how distributional behavior is approximated through context vectors, how context vectors are transferred into target context vectors, and how context vectors are compared.</Paragraph> <Section position="1" start_page="0" end_page="11" type="sub_section"> <SectionTitle> 4.1 Computing context vectors </SectionTitle> <Paragraph position="0"> Each input corpus is segmented at non-alphanumeric characters. Stop words are then removed, and a simple lemmatization is performed. For English, we used a list of stop words that we had from a former project. For French, we merged Savoy's online stop words list (www.unine.ch/info/clef) with a list of our own.</Paragraph> <Paragraph position="1"> The S-stemmer algorithm (Harman, 1991) was applied to the English words. Another simple stemmer was used for French; it handles some -s and -x endings.</Paragraph> <Paragraph position="2"> The context of occurrence of each word is then approximated by the bag of words that occur within a window of N words around any occurrence of that 'pivot' word. In the experiments reported here, N wassetto3(i.e., a seven-word window) to approximate syntactic dependencies. The context vector of apivotwordj is the vector of all words in the corpus, null where each word i is represented by its number of occurrences occ</Paragraph> <Paragraph position="4"> in that bag of words.</Paragraph> <Paragraph position="5"> A context vector is similar to a document (the document that would be produced by concatenating the windows around all the occurrences of the given pivot word). Therefore, weights that are used for words in documents can be tested here in order to eliminate word-frequency effects and to emphasize significant word pairs. Besides simple context fre-</Paragraph> <Paragraph position="7"> , two additional, alternative weights are computed: tf:idf and log likelihood.</Paragraph> <Paragraph position="8"> We shall see below that actually, only a subset of the corpus words will be kept in each vector.</Paragraph> <Paragraph position="9"> The formulas we used to compute tf:idf are the following: the normalized frequency of a word i in a context j is tf</Paragraph> <Paragraph position="11"> is the maximum number of cooccurrences of any two words in the corpus;</Paragraph> <Paragraph position="13"> is the total number of contexts in which i occurs in the corpus.</Paragraph> <Paragraph position="14"> For the computation of the log likelihood ratio, we used the following formula from Dunning:</Paragraph> <Paragraph position="16"> .</Paragraph> <Paragraph position="17"> At the end of this step, each non-stop word in both corpora has a weighted context vector.</Paragraph> </Section> <Section position="2" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.2 Transferring context vectors </SectionTitle> <Paragraph position="0"> When a translation is sought for a source word, its context vector is transferred into a target language context vector, relying on the existing bilingual lexicon. Only the words in the bilingual lexicon can be used in the transfer. When several translations are listed, only the first one is added to the target context vector. The result is a target-language context vector which is comparable to 'native' context vectors directly obtained from the target corpus.</Paragraph> <Paragraph position="1"> Let us now be more precise about the context-word space. Since we want to compare context vectors obtained through transfer with native context vectors, these two sorts of vectors should belong to the same space, i.e., range over the same set of context words. A (target) word belongs to this set iff #28i#29 it occurs in the target corpus, #28ii#29 it is listed in the bilingual lexicon, and #28iii#29 (one of) its source counterpart(s) occurs in the source corpus. This set corresponds to the 'seed words' of Fung and Yee (1998). Therefore, the dimension of the target context vectors is reduced to this set of 'cross-language pivot words'. In our experimental setting, 4,963 pivot words are used.</Paragraph> </Section> <Section position="3" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.3 Computing vector similarity </SectionTitle> <Paragraph position="0"> Given a transferred context vector, for each native target vector, a similarity score is computed; a rank- null Posted on the 'corpora' mailing list on 22/7/1997 (helmer.hit.uib.no/corpora/1997-2/0148.html).</Paragraph> <Paragraph position="1"> ing list is built according to this score. The target words that 'own' the best-ranked target vectors are the words in the target corpus whose distributions with respect to the bilingual pivot words are the most similar to that of the source word; they are considered candidate translational equivalents.</Paragraph> <Paragraph position="2"> We used several similarity metrics for comparing pairs of vectors V and W (of length n): Jaccard (Romesburg, 1990) and cosine (Losee, 1998), each combined with the three different weighting schemes. With k;l;m ranging from 1 to n:</Paragraph> <Paragraph position="4"/> </Section> <Section position="4" start_page="11" end_page="11" type="sub_section"> <SectionTitle> 4.4 Experiments </SectionTitle> <Paragraph position="0"> The present work performs a first evaluation of this method in a favorable, controlled setting. It tests, in a 'leave-one-out' style, whether the correct translation of one of the source (French) words in the bilingual lexicon can be found among the target (English) words of this lexicon, based on context vector similarity. To make similarity measures more reliable, we selected the most frequent words in the English corpus (N occ #3E 100) whose French translations were known in our lexicon. Among these, we chose the most frequent ones (N occ #3E 60)in the French corpus. This provides us with a test set of 95 French words #28i#29 which are frequent in the French corpus, #28ii#29 of which we know the correct translation, and #28iii#29 such that this translation occurs often in the English corpus. For each of the French test words, we computed a weighted context vector for each of the different weighting measures (occ</Paragraph> <Paragraph position="2"> , tf:idf, log likelihood). Then, using the above-mentioned similarity measures (cosine, Jaccard), we compared this weighted vector with the set of cross-language pivot words's context vectors computed from the English corpus. We then produced a ranked list of the top translational equivalents and tested whether the expected translation can be differentiated from other well-known domain words. For the evaluation, we computed the rank of the expected translation of each test word and synthesized them as a percentile rank distribution.</Paragraph> </Section> </Section> <Section position="6" start_page="11" end_page="11" type="metho"> <SectionTitle> 5 Initial Results </SectionTitle> <Paragraph position="0"> Table 2 shows example results for the French words anxiete and infection with different weightings and similarity measures. For reasons of space, we only Meas. Weight Fr word En word R Top 5 ranked candidate translations Cos. occ</Paragraph> <Paragraph position="2"> anxiete anxiety 1 anxiety .55, depression .45, medication .36, insomnia .36, memory .34 Cos. tf:idf anxiete anxiety 1 anxiety .54, depression .41, eclipse .33, medication .29, psychiatrist .29 Cos. loglike anxiete anxiety 1 anxiety .56, depression .43, eclipse .37, psychiatrist .36, dysthymia .33 Jac. occ</Paragraph> <Paragraph position="4"> anxiete anxiety 2 memory .21, anxiety .21, insomnia .19, confusion .19, psychiatrist .18 Jac. tf:idf anxiete anxiety 1 anxiety .21, psychiatrist .17, confusion .15, memory .14, phobia .14 Jac. loglike anxiete anxiety 1 anxiety .26, psychiatrist .19, memory .15, phobia .14, depressed .14 Cos. occ</Paragraph> <Paragraph position="6"> infection infection 2 infected .55, infection .52, neurotropic .47, homosexual .43 Cos. tf:idf infection infection 3 infected .56, neurotropic .49, infection .48, aids .45, homosexual .41 Cos. loglike infection infection 2 infected .67, infection .55, neurotropic .53, aids .48, homosexual .48 Jac. occ</Paragraph> <Paragraph position="8"> infection infection 1 infection .33, aids .21, tract .17, positive .16, prevention .15 Jac. tf:idf infection infection 1 infection .27, aids .24, positive .17, hiv .15, virus .15 Jac. loglike infection infection 1 infection .38, aids .27, tract .18, infected .18, positive .17 print out the top 5 ranked words. Rank refers to the performance of our program, with a 1 meaning that the correct translation of the input French word was found as the first candidate.</Paragraph> <Paragraph position="9"> A percentile rank (figure 1) showed that using the combination of occ</Paragraph> <Paragraph position="11"> and Jaccard, about 20% of the French test words have their correct translation as the first ranked word. If we look at the best ranked words, we find that they have a strong thematic relation: e.g., anxiety, depression, psychiatrist, phobia, or infection, infected, aids, homosexual.</Paragraph> </Section> class="xml-element"></Paper>