File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/95/w95-0114_metho.xml
Size: 17,124 bytes
Last Modified: 2025-10-06 14:14:06
<?xml version="1.0" standalone="yes"?> <Paper uid="W95-0114"> <Title>i Compiling Bilingual Lexicon Entries From a Non-Parallel English-Chinese Corpus</Title> <Section position="4" start_page="173" end_page="174" type="metho"> <SectionTitle> 3 Some Linguistic Characteristics of Chinese </SectionTitle> <Paragraph position="0"> We have chosen Chinese and English as the two languages from which we will build a bilingual dictionary.</Paragraph> <Paragraph position="1"> Since these languages are significantly different, we need to develop an algorithm which does not rely on any similarity between the languages, and which can be readily extended to other language pairs.</Paragraph> <Paragraph position="2"> It is useful to point out some significant differences between Chinese and English in order to help explain the output of our experiments: Chinese texts have no word delimiters. It is necessary to perform tokenization on the text by using a Chinese tokenizer. Since the tokenizer is not perfect, the word translation extraction process is affected by this preprocessing.</Paragraph> <Paragraph position="3"> Chinese part-of-speech classes are very ambiguous; many words can be both adjective or noun, noun or verb. Many adjectives can also act as adverbs with no morphological change.</Paragraph> <Paragraph position="4"> Chinese words have little or no morphological information. There are no inflections for nouns, adjectives or verbs to indicate gender, number, case, tense or person (Xi 1985). There is no capitalization to indicate the beginning of a sentence.</Paragraph> <Paragraph position="5"> There are very few function words in Chinese compared to other languages, especially to English. Moreover, function words in Chinese are frequently omitted.</Paragraph> </Section> <Section position="5" start_page="174" end_page="174" type="metho"> <SectionTitle> 5 A vast number of acronyms are employed in Chinese, which means many single words in Chinese can </SectionTitle> <Paragraph position="0"> be translated into compound words in English. Hong Kong Chinese use many terms borrowed from classical Chinese which tend to be more concise. The usage of idioms in Chinese is significantly more frequent than in English.</Paragraph> <Paragraph position="1"> Points 3,4, and 5 contribute to the fact that the Chinese text of our corpus has fewer unique words than in English.</Paragraph> </Section> <Section position="6" start_page="174" end_page="119421" type="metho"> <SectionTitle> 4 Context Heterogeneity of a Word </SectionTitle> <Paragraph position="0"> In a non-parallel Corpus, a domain-specific term and its translation are used in different sentences in the two texts. Take the example of the word air in the English text. Its concordance is shown partly in Table 4. It occurred 176 times. Its translation ~ occurred 37 times in the Chinese text and part of its concordance is shown in Table 4, They are used in totally different sentences. Thus, we cannot hope that their occurrence frequencies would correspond to each other in any significant way.</Paragraph> <Paragraph position="1"> On the other hand, air/~ are domain-specific words in the text, meaning something we breathe, as opposed to of some kind of ambiance or attitude. They are used mostly in similar contexts, as shown in the concordances. If we look at the content word preceding air in the concordance, and the content word following it, we notice that air is not randomly paired with other words. There are a limited number of word bigrams (x, W) and a limited number of word bigrams (W, y) where W is the word air; likewise for ~.</Paragraph> <Paragraph position="2"> The number of such unique bigrams indicate a degree of heterogeneity of this word in a text in terms of its neighbors.</Paragraph> <Paragraph position="3"> We define the context heterogeneity vector of a word W to be an ordered pair (x, y) where: a left heterogeneity x = -;</Paragraph> <Paragraph position="5"> immediately preceding W in the text; b = number of different types of tokens immediately following W in the text; c = number of occurrences of W in the text; The context heterogeneity of any function word, such as the, would have x and y values very close to one, since it can be preceded or followed by many different words. On the other hand, the x value of the word am is small because it always follows the word I.</Paragraph> <Paragraph position="6"> We postulate that the context heterogeneity of a given domain-specific word is more similar to that of its translation in another language than that of an unrelated word in the other language, and that this is a more salient feature than their occurrence frequencies in the two texts.</Paragraph> <Paragraph position="7"> For example, the context heterogeneity of air is (119/176, 47/176) = (0.676, 0.267) and the context heterogeneity of its translation in Chinese, ~ is (29/37, 17/37) = (0.784, 0.459). The context heterogeneity of the word (~k~/adjournment, on the other hand, is (37/175, 16/175) = (0.211, 0.091). Notice that although air and (~k~ have similar occurrence frequencies, their context heterogeneities have very different</Paragraph> <Paragraph position="9"> values, indicating that air has much more productive context than (~N. On the other hand, gg~ has more similar context heterogeneity values as those of air even though its occurrence frequency in the Chinese text is much lower.</Paragraph> <Paragraph position="10"> people to enjoy fresh , is it possible for room houses and institutions. I believe that Chicago Expo told people all about likely to be attracted to visit Expo by overnment needs to come out of its old the problems of refuse, sewage, polluted ociety marching parallel with decline our * It will cover whole spectrum pollution : KMB is now experimenting with air, exercise, and a complete change of air - conditioners to be provided air - conditioners air - conditioning and the 1 9 3 9 Expo in air would only aggravate the problem.</Paragraph> <Paragraph position="11"> air - tight armour suit which might serve air, noise and chemical air and water and general air, noise, water and wastes.</Paragraph> <Paragraph position="12"> air - conditioned double - deckers</Paragraph> </Section> <Section position="7" start_page="119421" end_page="119421" type="metho"> <SectionTitle> 5 Distance Measure between two Context Heterogeneity Vectors </SectionTitle> <Paragraph position="0"> To measure the similarity between two context heterogeneity vectors, we use simple Euclidean distance g</Paragraph> <Paragraph position="2"> The Euclidean distance between air and ~ is 0.2205 whereas the distance between air and (~k~ is 0.497. We use the ordered pair based on the assumption that the word order for nouns in English and Chinese are similar most of the times. For example, air pollution is translated into ~.</Paragraph> <Paragraph position="3"> 6 Filtering out Function Words in English There are many function words in English which do not translate into Chinese. This is because in most Asian languages, there are very few function words compared to Indo-European languages. Function words in Chinese or Japanese are frequently omitted. This partly contributes to the fact that there are far fewer Chinese words than English words in two texts of similar lengths.</Paragraph> <Paragraph position="4"> Since these functions words such as the, a, of will affect the context heterogeneity of most nouns in English while giving very little information, we filter them out from the English text. This heuristic greatly increased the context heterogeneity values of many nouns. The list of function words filtered out are the, a, an, this, that, of, by, for, in, to. This is by no means a complete list of English function words. More vigorous statistical training methods could probably be developed to find out which function words in English have no Chinese correspondences. However, if one uses context heterogeneity in languages having more function words such as French, it is advisable that filtering be carried out on both texts.</Paragraph> <Paragraph position="5"> 7 Experiment 1: Finding Word Translation Candidates Given the simplicity of our current context heterogeneity measures and the complexity of finding translations from a non-parallel text in which many words will not find their translations, we propose to use context heterogeneity only as a bootstrapping feature in finding a candidate list of translations for a word.</Paragraph> <Paragraph position="6"> In our first experiment, we hand-compiled a list of 58 word pairs as in Tables 3 and 4 in English and Chinese, and then used 58 by 58 context heterogeneity measures to match them against each other. Note that this list consists of many single character words which have ambiguities in Chinese, English words which should have been part of a compound word, multiple translations of a single word in English, etc. The initial results are revealing as shown by the histograms in Figure 1.</Paragraph> <Paragraph position="7"> In the left figure, we show that 12 words have their translations in the top 5 candidates. In the right figure, we show the result of filtering out the Chinese genitive ~ from the Chinese texts. In this case, we can see that over 50% of the words found their translation in the top 10 candidates, although it gives fewer words with translations in top 5.</Paragraph> <Paragraph position="8"> In Sections 7.1 to 7.4, we will discuss the effects of various factors on our results.</Paragraph> <Section position="1" start_page="119421" end_page="119421" type="sub_section"> <SectionTitle> 7.1 Effect of Chinese Tokenization </SectionTitle> <Paragraph position="0"> We used a statistically augmented Chinese tokenizer for finding word boundaries in the Chinese text (Fung & Wu 1994; Wu & Fung 1994). Chinese tokenization is a difficult problem and tokenizers always have errors. Most single Chinese characters can be joined with other character(s) to form different words. So the translation of a single Chinese character is ill-defined. Moreover, in some cases, our Chinese tokenizer groups frequently co-occurring characters into a single word that does not have independent semantic meanings.</Paragraph> <Paragraph position="1"> For example, (~/-th item, number. In the above cases, the context heterogeneity values of the Chinese translation is not reliable. However, translators would recognize this error readily and would not consider it as a translation candidate.</Paragraph> </Section> <Section position="2" start_page="119421" end_page="119421" type="sub_section"> <SectionTitle> 7.2 Effect of English Compound Words </SectionTitle> <Paragraph position="0"> As we have mentioned, our Chinese text has many acronyms and idioms which were identified by our tokenizer and grouped into a single word. However, the English text did not under go a collocation extraction procesS. We can use the following heuristic to overcome the problem: For a given word Wi in a trigram of (Wi-1, Wi, Wi+l) with context heterogeneity (x, y):</Paragraph> </Section> <Section position="3" start_page="119421" end_page="119421" type="sub_section"> <SectionTitle> 7.3 Effect of Words with Multiple Functions </SectionTitle> <Paragraph position="0"> As mentioned earlier, many Chinese words have multiple part-of-speech tags such as the &quot;Chinese for declaration~declare, development~developing, adjourned~adjournment, or expenditure~spend. Therefore these words have one-to-many mappings with English words.</Paragraph> <Paragraph position="1"> We could use part-of-speech taggers to label these words with different classes, effectively treating them as different words.</Paragraph> <Paragraph position="2"> Another way to reduce one-to-many mapping between Chinese and English words could be to use a morphological analyzer in English to map all English words of the same roots with different case, gender, tense, number, capitalization to a single word type.</Paragraph> </Section> <Section position="4" start_page="119421" end_page="119421" type="sub_section"> <SectionTitle> 7.4 Effect of Word Order </SectionTitle> <Paragraph position="0"> We had assumed that the trigram word order in Chinese and English are similar. Yet in a non-parallel text, nouns can appear either before a verb or after, as a subject or an object and thus, it is conceivable that we should relax the distance measure to be: = ~/(~ - ~)~ + (y~ - y~)~ + (~ - y~)~ + (yx - ~)~ We applied this measure and indeed improved on the scores for nouns such as vessels, Government, employers, debate, prosperity. In some other languages such as French and English, word order for trigrams containing nouns could be reversed most of the time. For example, air pollution would be translated into pollution d'air. For adjective-noun pairs, Chinese, English and even Japanese share similar orders, whereas French has adjective-noun pairs in the reverse order most of the time. So when we apply context heterogeneity measures to word pairs in English and French, we might map the left heterogeneity in English to the right heterogeneity in French, and vice versa.</Paragraph> <Paragraph position="1"> 8 Experiment 2: Finding the Word Translation Among a Cluster of Words The above experiment showed to some extent the clustering ability of context heterogeneity. To test the discriminative ability of this feature, we choose two clusters of known English and Chinese word pairs debate/~-~. We obtained a cluster of Chinese words centered around ~-~ by applying the Kvec segment co-occurrence score (Fung & Church 1994) on the Chinese text with itself. The Kvec algorithm was previously used to find co-occurring bilingual word pairs with many candidates. In our experiment, the co-occurrence happens within the same text, and therefore we got a candidate list for ~-~ that is a cluster of words similar</Paragraph> <Paragraph position="3"> to it in terms of occurrence measure. This cluster was proposed as a candidate translation list for debate.</Paragraph> <Paragraph position="4"> We applied context heterogeneity measures between debate and the Chinese word list, with the result shown in Table 5 with the best translation at the top.</Paragraph> <Paragraph position="5"> t'~J~il~/Second Reading of the Bill passed The asteriskslin Table 5 indicate tokenizer error. The correct translation is the third candidate. Although we cannot say at this point that this result is significant, it is to some extent encouraging. It is interesting to note that if we applied the same Kvec algorithm to the English part of the text, we would get a cluster of English words which contain individual translations to some of the words in the Chinese cluster. This shows that co-occurrence measure can give similar clusters of words in different languages from non-parallel texts.</Paragraph> </Section> </Section> <Section position="8" start_page="119421" end_page="119421" type="metho"> <SectionTitle> 9 Non-parallel Corpora Need to be Larger than Parallel Corpora </SectionTitle> <Paragraph position="0"> Among the 58 words we selected, there is one word service which occurred 926 times in the English text, but failed to appear even once in the Chinese text (presumably the Legco debate focused more on the issue of various public and legal services in Hong Kong during the 1988-90 time frame than later during 1991-92.</Paragraph> <Paragraph position="1"> And in English they frequently accuse each other of paying lip service to various issues). We expect there would be a great flumber of words which simply do not have their translations in the other text. Words which occur very few times also have unreliable context heterogeneity. A logical way to cope with such sparse data problem is to use larger non-parallel corpora. Our texts each have about 3 million words, which is much smaller than the parallel Canadian Hansard used for the same purposes. Because it was divided into two parts to form a non-parallel corpus, it is also half in size to the parallel corpus used for word alignment (Wu & Xia 1994). With a larger corpus, there will be more source words in the vocabulary for us to translate, and more target candidates to choose from.</Paragraph> </Section> <Section position="9" start_page="119421" end_page="119421" type="metho"> <SectionTitle> 10 Future Work </SectionTitle> <Paragraph position="0"> We have explained that there are various immediate ways to improve context heterogeneity measures by including more linguistic information about Chinese and English such as word class correspondence and word order correspondence, as well as by using a larger context window. Meanwhile, much larger non-parallel corpora are needed for compilation of bilingual lexicons. We are currently experimenting on using some other similarity measures between word pairs from non-parallel corpora. We plan eventually to incorporate context heterogeneity measures and other word pair similarity measures into bilingual lexicon learning paradigms.</Paragraph> </Section> class="xml-element"></Paper>