File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/02/w02-0902_intro.xml
Size: 15,192 bytes
Last Modified: 2025-10-06 14:01:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W02-0902"> <Title>Learning a Translation Lexicon from Monolingual Corpora</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 2 Clues </SectionTitle> <Paragraph position="0"> This section will describe clues that enable us to nd translations of words of the two monolingual corpora. We will examine each clue separately.</Paragraph> <Paragraph position="1"> The following clues are considered: Identical words { Two languages contain a certain number of identical words, such as computer or email.</Paragraph> <Paragraph position="2"> Similar Spelling { Some words may have very similarly written translations due to common language roots (e.g. Freund and friend) or adopted words (e.g. Webseite and website).</Paragraph> <Paragraph position="3"> Context { Words that occur in a certain context window in one language have translations that are likely to occur in a similar context window in the other language (e.g.</Paragraph> <Paragraph position="4"> Wirtschaft co-occurs frequently with Wachstum, as economy does with growth).</Paragraph> <Paragraph position="5"> Similarity { Words that are used similarly in one language should have translations that are also similar (e.g. Wednesday is similar to Thursday as Mittwoch is similar to Donnerstag).</Paragraph> <Paragraph position="6"> Frequency { For comparable corpora, frequent words in one corpus should have translations that are frequent in the other corpus (e.g. for news corpora, government is more frequent than ower, as its translation Regierung is more frequent than Blume.</Paragraph> <Paragraph position="7"> We will now look in detail how these clues may contribute to building a German-English translation lexicon.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.1 Identical words </SectionTitle> <Paragraph position="0"> Due to cultural exchange, a large number of words that originate in one language are adopted by others. Recently, this phenomenon can be seen with words such as Internet, or Aids.</Paragraph> <Paragraph position="1"> These terms may be adopted verbatim, or changed by well-established rules. For instance, immigration (German and English) has the Portuguese translation immigra c~ao, as many words ending in -tion have translations with the same spelling except for the ending changed to - c~ao.</Paragraph> <Paragraph position="2"> We examined the German words in our lexicon and tried to nd English words that have the exact same spelling. Surprisingly, we could count a total of 976 such words. When checking them against a benchmark lexicon, we found these mappings to be 88% correct.</Paragraph> <Paragraph position="3"> The correctness of word mappings acquired in this fashion depends highly on word length.</Paragraph> <Paragraph position="4"> This is illustrated in Table 1: While identical 3letter words are only translations of each other 60% of the time, this is true for 98% of 10-letter words. Clearly, for shorter words, the accidental existence of an identically spelled word in the other language word is much higher. This includes words such as fee, ton, art, and tag.</Paragraph> <Paragraph position="5"> spelled words are in fact translations of each other: The accuracy of this assumption depends highly on the length of the words (see Section Knowing this allows us to restrict the word length to be able to increase the accuracy of the collected word pairs. For instance, by relying only on words at least of length 6, we could collect 622 word pairs with 96% accuracy. In our experiments, however, we included all the words pairs.</Paragraph> <Paragraph position="6"> As already mentioned, there are some well-established transformation rules for the adoption of words from a foreign language. For German to English, this includes replacing the letters k and z by c and changing the ending -t at by -ty. Both these rules can be observed in the word pair Elektrizit at and electricity.</Paragraph> <Paragraph position="7"> By using these two rules, we can gather 363 additional word pairs of which 330, or 91%, are in fact translations of each other. The combined total of 1339 (976+363) word pairs are separated and form the seed for some of the following steps.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Similar Spelling </SectionTitle> <Paragraph position="0"> When words are adopted into another language, their spelling might change slightly in a manner that can not be simply generalized in a rule. Observe, for instance website and Webseite. This is even more the case for words that can be traced back to common language roots, such as friend and Freund, or president and Pr asident.</Paragraph> <Paragraph position="1"> Still, these words { often called cognates { maintain a very similar spelling. This can be de ned as di ering in very few letters. This measurement can be formalized as the number of letters common in sequence between the two words, divided by the length of the longer word.</Paragraph> <Paragraph position="2"> The example word pair friend and freund shares 5 letters (fr-e-nd), and both words have length 6, hence there spelling similarity is 5/6, or 0.83. This measurement is called longest common subsequence ratio [Melamed, 1995]. In related work, string edit distance (or, Levenshtein distance) has been used [Mann and Yarowski, 2001].</Paragraph> <Paragraph position="3"> With this computational means at hand, we can now measure the spelling similarity between every German and English word, and sort possible word pairs accordingly. By going through this list starting at the top we can collect new word pairs. We do this is in a greedy fashion { once a word is assigned to a word pair, we do not look for another match. Table 2 gives the top 24 generated word pairs by this algorithm.</Paragraph> <Paragraph position="4"> ing words with most similar spelling in a greedy fashion.</Paragraph> <Paragraph position="5"> The applied measurement of spelling similarity does not take into account that certain letter changes (such as z to s, or dropping of the nal e) are less harmful than others. Tiedemann [1999] explores the automatic construction of a string similarity measure that learns which letter changes occur more likely between cognates of two languages. This measure is trained, however, on parallel sentence-aligned text, which is not available here.</Paragraph> <Paragraph position="6"> Obviously, the vast majority of word pairs can not be collected this way, since their spelling shows no resemblance at all. For instance, Spiegel and mirror share only one vowel, which is rather accidental.</Paragraph> </Section> <Section position="3" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.3 Similar Context </SectionTitle> <Paragraph position="0"> If our monolingual corpora are comparable, we can assume a word that occurs in a certain context should have a translation that occurs in a similar context.</Paragraph> <Paragraph position="1"> Context, as we understand it here, is de ned by the frequencies of context words in surrounding positions. This local context has to be translated into the other language, and we can search the word with the most similar context.</Paragraph> <Paragraph position="2"> This idea has already been investigated in earlier work. Rapp [1995, 1999] proposes to collect counts over words occurring in a four word window around the target word. For each occurrence of a target word, counts are collected over how often certain context words occur in the two positions directly ahead of the target word and the two following positions. The counts are collected separately for each position and then entered into in a context vector with an dimension for each context word in each position. Finally, the raw counts are normalized, so that for each of the four word positions the vector values add up to one. Vector comparison is done by adding all absolute di erences of all components.</Paragraph> <Paragraph position="3"> Fung and Yee [1998] propose a similar approach: They count how often another word occurs in the same sentence as the target word.</Paragraph> <Paragraph position="4"> The counts are then normalized by a using the tf/idf method which is often used in information retrieval [Jones, 1979].</Paragraph> <Paragraph position="5"> The need for translating the context poses a chicken-and-egg problem: If we already have a translation lexicon we can translate the context vectors. But we can only construct a translation lexicon with this approach if we are already able to translate the context vectors.</Paragraph> <Paragraph position="6"> Theoretically, it is possible to use these methods to build a translation lexicon from scratch [Rapp, 1995]. The number of possible mappings has complexity O(n!), and the computing cost of each mapping has quadratic complexity O(n2).</Paragraph> <Paragraph position="7"> For a large number of words n { at least more than 10,000, maybe more than 100,000 { the combined complexity becomes prohibitively expensive. null Because of this, both Rapp and Fung focus on expanding an existing large lexicon to add a few novel terms.</Paragraph> <Paragraph position="8"> Clearly, a seed lexicon to bootstrap these methods is needed. Fortunately, we have outlined in Section 2.1 how such a seed lexicon can be obtained: by nding words spelled identically in both languages.</Paragraph> <Paragraph position="9"> We can then construct context vectors that contain information about how a new unmapped word co-occurs with the seed words. This vector can be translated into the other language, since we already know the translations of the seed words.</Paragraph> <Paragraph position="10"> Finally, we can look for the best matching context vector in the target language, and decide upon the corresponding word to construct a word mapping.</Paragraph> <Paragraph position="11"> Again, as in Section 2.2, we have to compute all possible word { or context vector { matches. We collect then the best word matches in a greedy fashion. Table 3 displays the top 15 generated word pairs by this algorithm. The context vectors are constructed in the way proposed by Rapp [1999], with the di erence that we collect counts over a four noun window, not a four word window, by dropping all intermediate words.</Paragraph> <Paragraph position="12"> ing words with most similar context vectors in a greedy fashion.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.4 Preserving Word Similarity </SectionTitle> <Paragraph position="0"> Intuitively it is obvious that pairs of words that are similar in one language should have translations that are similar in the other language.</Paragraph> <Paragraph position="1"> For instance, Wednesday is similar to Thursday as Mittwoch is similar to Donnerstag. Or: dog is similar to cat in English, as Hund is similar to Katze in German.</Paragraph> <Paragraph position="2"> The challenge is now to come up with a quanti able measurement of word similarity. One strategy is to de ne two words as similar if they occur in a similar context. Clearly, this is the case for Wednesday and Thursday, as well as for dog and cat.</Paragraph> <Paragraph position="3"> Exactly this similarity measurement is used in the work by Diab and Finch [2000]. Their approach to constructing and comparing context vectors di ers signi cantly from methods discussed in the previous section.</Paragraph> <Paragraph position="4"> For each word in the lexicon, the context vector consists of co-occurrence counts in respect to 150 so-called peripheral tokens, basically the most frequent words. These counts are collected for each position in a 4-word window around the word in focus. This results in a 600-dimensional vector.</Paragraph> <Paragraph position="5"> Instead of comparing these co-occurrence counts directly, the Spearman rank order correlation is applied: For each position the tokens are compared in frequency and the frequency count is replaced by the frequency rank { the most frequent token count is replaced by 1, the least frequent by n = 150. The similarity of two context vectors a = (ai) and b = (bi) is then de ned by:3</Paragraph> <Paragraph position="7"> The result of all this is a matrix with similarity scores between all German words, and second one with similarity scores between all English words. Such matrices could also be constructed using the de nitions of context we reviewed in the previous section. The important point here is that we have generated a similarity matrix, which we will use now to nd new translation word pairs.</Paragraph> <Paragraph position="8"> Again, as in the previous Section 2.3, we as3In the given formula we xed two mistakes of the original presentation [Diab and Finch, 2000]: The square of the di erences is used, and the denominator contains the additional factor 4, since essentially 4 150-word vectors are compared.</Paragraph> <Paragraph position="9"> sume that we will already have a seed lexicon.</Paragraph> <Paragraph position="10"> For a new word we can look up its similarity scores to the seed words, thus creating a similarity vector. Such a vector can be translated into the other language { recall that dimensions of the vector are the similarity scores to seed words, for which we already have translations.</Paragraph> <Paragraph position="11"> The translated vector can be compared to other vectors in the second language.</Paragraph> <Paragraph position="12"> As before, we search greedily for the best matching similarity vectors and add the corresponding words to the lexicon.</Paragraph> </Section> <Section position="5" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.5 Word Frequency </SectionTitle> <Paragraph position="0"> Finally, another simple clue is the observation that in comparable corpora, the same concepts should be used with similar frequencies. Even if the most frequent word in the German corpus is not necessarily the translation of the most frequent English word, it should also be very frequent. null Table 4 illustrates the situation with our corpora. It contains the top 10 German and English words, together with the frequency ranks of their best translations. For both languages, 4 of the 10 words have translations that also rank in the top 10.</Paragraph> <Paragraph position="1"> Clearly, simply aligning the nth frequent German word with the nth frequent English word is not a viable strategy. In our case, this is additionally hampered by the di erent orientation of the news sources. The frequent nancial terms in the English WSJ corpus (stock, bank, sales, etc.) are rather rare in the German corpus.</Paragraph> <Paragraph position="2"> For most words, especially for more comparable corpora, there is a considerable correlation between the frequency of a word and its translation. Our frequency measurement is de ned as ratio of the word frequencies, normalized by the corpus sizes.</Paragraph> </Section> </Section> class="xml-element"></Paper>