File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-1069_metho.xml
Size: 13,264 bytes
Last Modified: 2025-10-06 14:14:56
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-1069"> <Title>An IR Approach for Translating New Words from Nonparallel, Comparable Texts</Title> <Section position="3" start_page="0" end_page="415" type="metho"> <SectionTitle> 2 Encountering new words </SectionTitle> <Paragraph position="0"> To improve the performance of a machine translation system, it is often necessary to update its bilingual lexicon, either by human lexicographers or statistical methods using large corpora. Up until recently, statistical bilingual lexicon compilation relies largely on parallel corpora. This is an undesirable constraint at times.</Paragraph> <Paragraph position="1"> In using a broad-coverage English-Chinese MT system to translate some text recently, we discovered that it is unable to translate ~,~,/liougan which occurs very frequently in the text.</Paragraph> <Paragraph position="2"> Other words which the system cannot find in its 20,000-entry lexicon include proper names such as the Taiwanese president Lee Teng-Hui, and the Hong Kong Chief Executive Tung Chee-Hwa. To our disappointment, we cannot locate any parallel texts which include such words since they only start to appear frequently in recent months.</Paragraph> <Paragraph position="3"> A quick search on the Web turned up archives of multiple local newspapers in English and Chinese. Our challenge is to find the translation of ~/liougan and other words from this online nonparallel, comparable corpus of newspaper materials. We choose to use issues of the English newspaper Hong Kong Standard and the Chinese newspaper Mingpao, from Dec.12,97 to Dec.31,97, as our corpus. The English text contains about 3 Mb of text whereas the Chinese text contains 8.8 Mb of 2 byte character texts.</Paragraph> <Paragraph position="4"> So both texts are comparable in size. Since they are both local mainstream newspapers, it is reasonable to assume that their contents are comparable as well.</Paragraph> <Paragraph position="5"> 3 YL~,/liougan is associated with flu but not with Africa Unlike in parallel texts, the position of a word in a text does not give us information about its translation in the other language. (Rapp, 1995; Fung and McKeown, 1997) suggest that a content word is closely associated with some words in its context. As a tutorial example, we postulate that the words which appear in the context of ~/liougan should be similar to the words appearing in the context of its English translation, flu. We can form a vector space model of a word in terms of its context word indices, similar to the vector space model of a text in terms of its constituent word indices (Salton and Buckley, 1988; Salton and Yang, 1973; Croft, 1984; Turtle and Croft, 1992; Bookstein, 1983; Korfhage, 1995; Jones, 1979).</Paragraph> <Paragraph position="6"> The value of the i-th dimension of a word vector W is f if the i-th word in the lexicon appears f times in the same sentences as W.</Paragraph> <Paragraph position="7"> Left columns in Table 1 and Table 2 show the list of content words which appear most frequently in the context of flu and Africa respectively. The right column shows those which occur most frequently in the context of ~,~,. We can see that the context of ~ is more similar to that of flu than to that of Africa.</Paragraph> <Paragraph position="8"> So the first clue to the similarity between a word and its translation number of common words in their contexts. In a bilingual corpus, the &quot;common word&quot; is actually a bilingual word pair. We use the lexicon of the MT system to &quot;bridge&quot; all bilingual word pairs in the corpora. These word pairs are used as seed words.</Paragraph> <Paragraph position="9"> We found that the contexts of flu and ~,~ /liougan share 233 &quot;common&quot; context words, whereas the contexts of Africa and ~,~/liougan share only 121 common words, even though the context of flu has 491 unique words and the context of Africa has 328 words.</Paragraph> <Paragraph position="10"> In the vector space model, W\[flu\] and W\[liougan\] has 233 overlapping dimensions, whereas there are 121 overlapping dimensions between W\[flu\] and W\[A frica\].</Paragraph> </Section> <Section position="4" start_page="415" end_page="416" type="metho"> <SectionTitle> 5 Using TF/IDF of contextual seed </SectionTitle> <Paragraph position="0"> words The flu example illustrates that the actual ranking of the context word frequencies provides a second clue to the similarity between a bilingual word pair. For example, virus ranks very high for both flu and ~g~/liougan and is a strong &quot;bridge&quot; between this bilingual word pair. This leads us to use the term frequency(TF) measure. The TF of a context word is defined as the frequency of the word in the context of W.</Paragraph> <Paragraph position="1"> (e.g. TF of virus in flu is 26, in ~,~ is 147).</Paragraph> <Paragraph position="2"> However, the TF of a word is not independent of its general usage frequency. In an extreme case, the function word the appears most frequently in English texts and would have the highest TF in the context of any W. In our HK-Standard/Mingpao corpus, Hong Kong is the most frequent content word which appears everywhere. So in the flu example, we would like to reduce the significance of Hong Kong's TF while keeping that of virus. A common way to account for this difference is by using the inverse document frequency(IDF). Among the variants of IDF, we choose the following representation from (Jones, 1979): maxn IDF = log--+l ni where maxn = the maximum frequency of any word in the corpus ni = the total number of occurrences of word i in the corpus The IDF of virus is 1.81 and that of Hong Kong is 1.23 in the English text. The IDF of ~,~ is 1.92 and that of Hong Kong is 0.83 in Chinese. So in both cases, virus is a stronger &quot;bridge&quot; for ~,~,/liougan than Hong Kong. Hence, for every context seed word i, we assign a word weighting factor (Salton and Buckley, 1988) wi = TFiw x IDFi where TFiw is the TF of word i in the context of word W. The updated vector space model of word W has wi in its i-th dimension.</Paragraph> <Paragraph position="3"> The ranking of the 20 words in the contexts of ~/liougan is rearranged by this weighting factor as shown in Table3.</Paragraph> <Paragraph position="4"> Next, a ranking algorithm is needed to match the unknown word vectors to their counterparts in the other language. A ranking algorithm selects the best target language candidate for a source language word according to direct comparison of some similarity measures (Frakes and Baeza-Yates, 1992).</Paragraph> <Paragraph position="5"> We modify the similarity measure proposed by (Salton and Buckley, 1988) into the following SO:</Paragraph> <Paragraph position="7"> Variants of similarity measures such as the above have been used extensively in the IR community (Frakes and Baeza-Yates, 1992). They are mostly based on the Cosine Measure of two vectors. For different tasks, the weighting factor might vary. For example, if we add the IDF into the weighting factor, we get the following</Paragraph> <Paragraph position="9"> In addition, the Dice and Jaccard coefficients are also suitable similarity measures for document comparison (Frakes and Baeza-Yates, 1992). We also implement the Dice coefficient into similarity measure $2:</Paragraph> <Paragraph position="11"> S1 is often used in comparing a short query with a document text, whereas $2 is used in comparing two document texts. Reasoning that our objective falls somewhere in between--we are comparing segments of a document, we also multiply the above two measures into a third similarity measure $3.</Paragraph> <Paragraph position="12"> 7 Confidence on seed word pairs In using bilingual seed words such as IN~/virus as &quot;bridges&quot; for terminology translation, the quality of the bilingual seed lexicon naturally affects the system output. In the case of European language pairs such as French-English, we can envision using words sharing common cognates as these &quot;bridges&quot;. Most importantly, we can assume that the word boundaries are similar in French and English. However, the situation is messier with English and Chinese.</Paragraph> <Paragraph position="13"> First, segmentation of the Chinese text into words already introduces some ambiguity of the seed word identities. Secondly, English-Chinese translations are complicated by the fact that the two languages share very little stemming properties, or part-of-speech set, or word order. This property causes every English word to have many Chinese translations and vice versa. In a source-target language translation scenario, the translated text can be &quot;rearranged&quot; and cleaned up by a monolingual language model in the target language. However, the lexicon is not very reliable in establishing &quot;bridges&quot; between non-parallel English-Chinese texts. To compensate for this ambiguity in the seed lexicon, we introduce a confidence weighting to each bilingual word pair used as seed words. If a word ie is the k-th candidate for word ic, then wi,~ = wi,~/ki.</Paragraph> <Paragraph position="14"> The similarity scores then become $4 and $5</Paragraph> <Paragraph position="16"> We also experiment with other combinations of the similarity scores such as $7 --- SO x $5.</Paragraph> <Paragraph position="17"> All similarity measures $3 - $7 are used in the experiment for finding a translation for ~,~,.</Paragraph> </Section> <Section position="5" start_page="416" end_page="417" type="metho"> <SectionTitle> 8 Results </SectionTitle> <Paragraph position="0"> In order to apply the above algorithm to find the translation for ~/liougan from the HKStandard/Mingpao corpus, we first use a script to select the 118 English content words which are not in the lexicon as possible candidates. Using similarity measures $3-$7, the highest ranking candidates of ~ are shown in Table 6. $6 and $7 appear to be the best similarity measures.</Paragraph> <Paragraph position="1"> We then test the algorithm with $7 on more Chinese words which are not found in the lexicon but which occur frequently enough in the Mingpao texts. A statistical new word extraction tool can be used to find these words. The unknown Chinese words and their English counterparts, as well as the occurrence frequencies of these words in HKStandard/Mingpao are shown in Table 4. Frequency numbers with a * indicates that this word does not occur frequent enough to be found. Chinese words with a * indicates that it is a word with segmentation and translation ambiguities. For example, (Lam) could be a family name, or part of another word meaning forest. When it is used as a family name, it could be transliterated into Lam in Cantonese or Lin in Mandarin.</Paragraph> <Paragraph position="2"> Disregarding all entries with a * in the above table, we apply the algorithm to the rest of the Chinese unknown words and the 118 English unknown words from HKStandard. The output is ranked by the similarity scores. The highest ranking translated pairs are shown in Table 5.</Paragraph> <Paragraph position="3"> The only Chinese unknown words which are not correctly translated in the above list are Hwa is a pair of collocates which is actually the full name of the Chief Executive. Poultry in Chinese is closely related to flu because the Chinese name for bird flu is poultry flu. In fact, almost all unambiguous Chinese new words find their translations in the first 100 of the ranked list. Six of the Chinese words have correct translation as their first candidate.</Paragraph> </Section> <Section position="6" start_page="417" end_page="418" type="metho"> <SectionTitle> 9 Related work </SectionTitle> <Paragraph position="0"> Using vector space model and similarity measures for ranking is a common approach in IR for query/text and text/text comparisons (Salton and Buckley, 1988; Salton and Yang, 1973; Croft, 1984; Turtle and Croft, 1992; Bookstein, 1983; Korfhage, 1995; Jones, 1979). This approach has also been used by (Dagan and Itai, 1994; Gale et al., 1992; Shiitze, 1992; Gale et al., 1993; Yarowsky, 1995; Gale and Church, 1994) for sense disambiguation between multiple usages of the same word. Some of the early statistical terminology translation methods are (Brown et al., 1993; Wu and Xia, 1994; Dagan and Church, 1994; Gale and Church, 1991; Kupiec, 1993; Smadja et al., 1996; Kay and RSscheisen, 1993; Fung and Church, 1994; Fung, 1995b). These algorithms all require parallel, translated texts as input. Attempts at exploring nonparallel corpora for terminology translation are very few (Rapp, 1995; Fung, 1995a; Fung and McKeown, 1997). Among these, (Rapp, 1995) proposes that the association between a word and its close collocate is preserved in any language, and (Fung and McKeown, 1997) suggests that the associations between a word and many seed words are also preserved in another language. In this paper, we have demonstrated that the associations between a word and its context seed words are well-preserved in nonparallel, comparable texts of different languages.</Paragraph> </Section> class="xml-element"></Paper>