File Information

File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/j04-1001_metho.xml

Size: 22,584 bytes

Last Modified: 2025-10-06 14:08:46

<?xml version="1.0" standalone="yes"?>
<Paper uid="J04-1001">
  <Title>c(c) 2004 Association for Computational Linguistics Word Translation Disambiguation Using Bilingual Bootstrapping</Title>
  <Section position="4" start_page="6" end_page="13" type="metho">
    <SectionTitle>
3. Bilingual Bootstrapping
</SectionTitle>
    <Paragraph position="0"/>
    <Section position="1" start_page="6" end_page="7" type="sub_section">
      <SectionTitle>
3.1 Basic Algorithm
</SectionTitle>
      <Paragraph position="0"> Bilingual bootstrapping makes use of a small amount of classified data and a large amount of unclassified data in both the source and the target languages in translation.</Paragraph>
      <Paragraph position="1"> It repeatedly constructs classifiers in the two languages in parallel and boosts the performance of the classifiers by classifying data in each of the languages and by exchanging information regarding the classified data between the two languages.</Paragraph>
      <Paragraph position="2"> Figures 3 and 4 illustrate the process of bilingual bootstrapping. Figure 5 shows the translation relationship among the ambiguous words plant, zhiwu, and gongchang.</Paragraph>
      <Paragraph position="3"> There is a classifier for plant in English. There are also two classifiers, one each for zhiwu and gongchang, respectively, in Chinese. Sentences containing plant in English and sentences containing zhiwu and gongchang in Chinese are used.</Paragraph>
      <Paragraph position="4"> In the beginning, sentences P1 and P4 on the English side are assigned labels 1 and 2, respectively (Figure 3). On the Chinese side, sentences G1 and G3 are assigned labels 1 and 3, respectively, and sentences Z1 and Z3 are assigned labels 2 and 4, respectively.</Paragraph>
      <Paragraph position="5"> The four labels here correspond to the four links in Figure 5. For example, label 1 represents the sense factory and label 2 represents the sense flora. Other sentences are  not labeled. Bilingual bootstrapping uses labeled sentences P1, P4, G1, and Z1 to create a classifier for plant disambiguation (between label 1 and label 2). It also uses labeled sentences Z1, Z3, and P4 to create a classifier for zhiwu and uses labeled sentences G1, G3, and P1 to create a classifier for gongzhang. Bilingual bootstrapping next uses the classifier for plant to label sentences P2 and P5 (Figure 4). It uses the classifier for zhiwu to label sentences Z2 and Z4, and uses the classifier for gongchang to label sentences G2 and G4. The process is repeated until we cannot continue.</Paragraph>
      <Paragraph position="6"> To describe this process formally, let E denote a set of words in English, C a set of words in Chinese, and T a set of senses (links) in a translation dictionary as shown in Figure 5. (Any two linked words can be translations of each other.) Mathematically, T is defined as a relation between E and C, that is, T [?] E x C. Let e stand for an ambiguous word in E, and g an ambiguous word in C. Also let e stand for a context word in E, c a context word in C, and t a sense in T.</Paragraph>
      <Paragraph position="7"> For an English word e, T</Paragraph>
      <Paragraph position="9"> ), t [?] T} represents the set of e's possible senses (i.e., its links), and C  For the example in Figure 5, when e = plant, we have T</Paragraph>
      <Paragraph position="11"> = {plant, vegetable}. Note that gongchang and zhiwu share the senses {1, 2} with plant.</Paragraph>
      <Paragraph position="13"> Similarly, for a Chinese word g, a binary classifier is defined as</Paragraph>
      <Paragraph position="15"> We perform bilingual bootstrapping as described in Figure 6. Note that we can, in principle, employ any kind of classifier here.</Paragraph>
      <Paragraph position="16"> The figure explains the process for English (left-hand side); the process for Chinese (right-hand side) behaves similarly. At step 1, for each ambiguous word e, we create binary classifiers for resolving its ambiguities (cf. lines 1-3). The main point here is that we use classified data from both languages to construct classifiers, as we describe in Section 3.2. For the example in Figure 3, we use both L</Paragraph>
      <Paragraph position="18"> = {1, 2}. Note that not only P1 and P4, but also Z1 and G1, are related to {1, 2}.</Paragraph>
      <Paragraph position="19"> At step 2, for each word e, we use its classifiers to select some unclassified instances from U e , classify them, and add them to L e (cf. lines 4-19). We repeat the process until we cannot continue.</Paragraph>
      <Paragraph position="20"> Lines 9-13 show that for each unclassified instance e e , we use the classifiers to classify it into the class (sense) t if t's posterior odds are the largest among the possible classes and are larger than a threshold th. For each class t, we store the classified  instances in S t . Lines 14-15 show that for each class t, we choose only the top b classified instances (in terms of the posterior odds), which are then stored in Q t . Lines 16-17 show that we create the classified instances by combining the instances with their classification labels. We note that after line 17 we can also employ the one-sense-per-discourse heuristic.</Paragraph>
    </Section>
    <Section position="2" start_page="7" end_page="11" type="sub_section">
      <SectionTitle>
3.2 An Implementation
</SectionTitle>
      <Paragraph position="0"> Although we can in principle employ any kind of classifier in BB, we use here naive Bayes (or naive Bayesian ensemble). We also use the EM algorithm in classified data transformation between languages. As will be made clear, this implementation of BB can naturally combine the features of naive Bayes (or naive Bayesian ensemble) and the features of EM. Hereafter, when we refer to BB, we mean this implementation of BB.</Paragraph>
      <Paragraph position="1"> We explain the process for English (left-hand side of Figure 6); the process for Chinese (right-hand side of figure) behaves similarly. At step 1 in BB, we construct a naive Bayesian classifier as described in Figure 7. At step 2, for each instance e e ,we use the classifier to calculate</Paragraph>
      <Paragraph position="3"/>
      <Paragraph position="5"> ) proceeds as follows.</Paragraph>
      <Paragraph position="6"> For the sake of readability, we rewrite P</Paragraph>
      <Paragraph position="8"> ) as P(e  |t). We define a finitemixture model of the form P(c  |t)=</Paragraph>
      <Paragraph position="10"> assume that the data in</Paragraph>
      <Paragraph position="12"> Creating a naive Bayesian classifier.</Paragraph>
      <Paragraph position="13"> are generated independently from the model. We can therefore employ the expectation-maximization (EM) algorithm (Dempster, Laird, and Rubin 1977) to estimate the parameters of the model, including P(e  |t). Note that e and c represent context words. Recall that E is a set of words in English, C is a set of words in Chinese, and T is a set of senses. For a specific English word e, C  We next estimate the parameters by iteratively updating them, as described in Figure 8, until they converge. Here f(c, t) stands for the frequency of c in the instances which have sense t. The context information in Chinese f(c, t e ) is then &amp;quot;transformed&amp;quot; into the</Paragraph>
    </Section>
    <Section position="3" start_page="11" end_page="13" type="sub_section">
      <SectionTitle>
3.3 Comparison of BB and MB
</SectionTitle>
      <Paragraph position="0"> We note that monolingual bootstrapping is a special case of bilingual bootstrapping (consider the situation in which a = 0 in formula (3)).</Paragraph>
      <Paragraph position="1"> BB can always perform better than MB. The asymmetric relationship between the ambiguous words in the two languages stands out as the key to the higher performance  of BB. By asymmetric relationship we mean the many-to-many mapping relationship between the words in the two languages, as shown in Figure 10.</Paragraph>
      <Paragraph position="2"> Suppose that the classifier with respect to plant has two classes (denoted as A and B in Figure 10). Further suppose that the classifiers with respect to gongchang and zhiwu in Chinese each have two classes (C and D) and (E and F), respectively. A and D are equivalent to one another (i.e., they represent the same sense), and so are B and E.</Paragraph>
      <Paragraph position="3"> Assume that instances are classified after several iterations of BB as depicted in Figure 10. Here, circles denote the instances that are correctly classified and crosses denote the instances that are incorrectly classified.</Paragraph>
      <Paragraph position="4"> Since A and D are equivalent to one another, we can transform the instances with D and use them to boost the performance of classification to A, because the misclassified instances (crosses) with D are those mistakenly classified from C, and they will not have much negative effect on classification to A, even though the translation from Chinese into English can introduce some noise. Similar explanations can be given for other classification decisions.</Paragraph>
      <Paragraph position="5"> In contrast, MB uses only the instances in A and B to construct a classifier. When the number of misclassified instances increases (as is inevitable in bootstrapping), its performance will stop improving. This phenomenon has also been observed when MB is applied to other tasks (cf. Banko and Brill 2001; Pierce and Cardie 2001).</Paragraph>
      <Paragraph position="6">  Li and Li Word Translation Disambiguation Using Bilingual Bootstrapping</Paragraph>
    </Section>
    <Section position="4" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
3.4 Relationship between BB and Co-training
</SectionTitle>
      <Paragraph position="0"> We note that there are similarities between BB and co-training. Both BB and co-training execute two bootstrapping processes in parallel and make the two processes collaborate with one another in order to improve their performance. The two processes look at different types of information in data and exchange the information in learning. However, there are also significant differences between BB and co-training. In co-training, the two processes use different features, whereas in BB, the two processes use different classes. In BB, although the features used by the two classifiers are transformed from one language into the other, they belong to the same space. In co-training, on the other hand, the features used by the two classifiers belong to two different spaces.</Paragraph>
    </Section>
  </Section>
  <Section position="5" start_page="13" end_page="21" type="metho">
    <SectionTitle>
4. Experimental Results
</SectionTitle>
    <Paragraph position="0"> We have conducted two experiments on English-Chinese translation disambiguation.</Paragraph>
    <Paragraph position="1"> In this section, we will first describe the experimental settings and then present the results. We will also discuss the results of several follow-on experiments.</Paragraph>
    <Section position="1" start_page="13" end_page="13" type="sub_section">
      <SectionTitle>
4.1 Translation Disambiguation Using BB
</SectionTitle>
      <Paragraph position="0"> Although it is possible to straightforwardly apply the algorithm of BB described in Section 3 to word translation disambiguation, here we use a variant of it better adapted to the task and for fairer comparison with existing technologies. The variant of BB we use has four modifications:  1. It actually employs naive Bayesian ensemble rather than naive Bayes, because naive Bayesian ensemble generally performs better than naive Bayes (Pedersen 2000).</Paragraph>
      <Paragraph position="1"> 2. It employs the one-sense-per-discourse heuristic. It turns out that in BB with one sense per discourse, there are two layers of bootstrapping. On the top level, bilingual bootstrapping is performed between the two languages, and on the second level, co-training is performed within each language. (Recall that MB with one sense per discourse can be viewed as co-training.) 3. It uses only classified data in English at the beginning. That is to say, it requires exactly the same human labeling efforts as MB does.</Paragraph>
      <Paragraph position="2"> 4. It individually resolves ambiguities on selected English words such as  plant and interest. (Note that the basic algorithm of BB performs disambiguation on all the words in English and Chinese.) As a result, in the case of plant, for example, the classifiers with respect to gongchang and zhiwu make classification decisions only on D and E and not C and F (in Figure 10), because it is not necessary to make classification decisions on C and F. In particular, it calculates l  and sets th = 0 in the right-hand side of step 2.</Paragraph>
    </Section>
    <Section position="2" start_page="13" end_page="14" type="sub_section">
      <SectionTitle>
4.2 Translation Disambiguation Using MB
</SectionTitle>
      <Paragraph position="0"> We consider here two implementations of MB for word translation disambiguation.</Paragraph>
      <Paragraph position="1"> In the first implementation, in addition to the basic algorithm of MB, we also use  is, it employs a decision list as the classifier. This implementation is exactly the one proposed in Yarowsky (1995). (We will denote it as MB-D hereafter.) MB-B and MB-D can be viewed as the state-of-the-art methods for word translation disambiguation using bootstrapping.</Paragraph>
    </Section>
    <Section position="3" start_page="14" end_page="15" type="sub_section">
      <SectionTitle>
4.3 Experiment 1: WSD Benchmark Data
</SectionTitle>
      <Paragraph position="0"> We first applied BB, MB-B, and MB-D to translation disambiguation on the English words line and interest using a benchmark data set.</Paragraph>
      <Paragraph position="1">  The data set consists mainly of articles from the Wall Street Journal and is prepared for conducting word sense disambiguation (WSD) on the two words (e.g., Pedersen 2000). We collected from the HIT dictionary  the Chinese words that can be translations of the two English words; these are listed in Table 1. One sense of an English word links to one group of Chinese words. (For the word interest, we used only its four major senses, because the remaining two minor senses occur in only 3.3% of the data.) For each sense, we selected an English word that is strongly associated with the sense according to our own intuition (cf. Table 1). We refer to this word as a seed word. For example, for the sense of money paid for the use of money, we selected the word rate. We viewed the seed word as a classified &amp;quot;sentence,&amp;quot; following a similar proposal in Yarowsky (1995). In this way, for each sense we had a classified instance in English. As unclassified data in English, we collected sentences in news articles from a Web site (www.news.com), and as unclassified data in Chinese, we collected sentences in news articles from another Web site (news.cn.tom.com). Note that we need to use only the sentences containing the words in Table 1. We observed that the distribution of the senses in the unclassified data was balanced. As test data, we used the entire benchmark data set.</Paragraph>
      <Paragraph position="2"> Table 2 shows the sizes of the data sets. Note that there are in general more unclassified sentences (and texts) in Chinese than in English, because one English word usually can link to several Chinese words (cf. Figure 5).</Paragraph>
      <Paragraph position="3"> As the translation dictionary, we used the HIT dictionary, which contains about 76,000 Chinese words, 60,000 English words, and 118,000 senses (links). We then used the data to conduct translation disambiguation with BB, MB-B, and MB-D, as described in Sections 4.1 and Section 4.2.</Paragraph>
      <Paragraph position="4">  with window sizes of +-1,+-3,+-5,+-7, and +-9 words, and we set the parameters b, b, and th to 0.2, 15, and 1.5, respectively. The parameters were tuned on the basis of our preliminary experimental results on MB-B; they were not tuned, however, for BB. We set the BB-specific parameter a to 0.4, which meant that we weighted information from English and Chinese equally.</Paragraph>
      <Paragraph position="5"> Table 3 shows the translation disambiguation accuracies of the three methods as well as that of a baseline method in which we always choose the most frequent sense. Figures 11 and 12 show the learning curves of MB-D, MB-B, and BB. Figure 13 shows the accuracies of BB with different a values. From the results, we see that BB consistently and significantly outperforms both MB-D and MB-B. The results from the sign test are statistically significant (p-value &lt; 0.001). (For the sign test method, see, for example, Yang and Liu [1999]).</Paragraph>
      <Paragraph position="6"> Table 4 shows the results achieved by some existing supervised learning methods with respect to the benchmark data (cf. Pedersen 2000). Although BB is a method nearly equivalent to one based on unsupervised learning, it still performs favorably when compared with the supervised methods (note that since the experimental settings are different, the results cannot be directly compared).</Paragraph>
    </Section>
    <Section position="4" start_page="15" end_page="18" type="sub_section">
      <SectionTitle>
4.4 Experiment 2: Yarowsky's Words
</SectionTitle>
      <Paragraph position="0"> We also conducted translation on seven of the twelve English words studied in Yarowsky (1995). Table 5 lists the words we used.</Paragraph>
      <Paragraph position="1">  English corpus and hand-labeled those sentences using our own Chinese translations. We used the labeled sentences as test data and the unlabeled sentences as unclassified data in English. Table 6 shows the data set sizes. We also used the sentences in the Great Encyclopedia  Chinese corpus as unclassified data in Chinese. We defined, for each sense, a seed word in English as a classified instance in English (cf. Table 5). We did not, however, conduct translation disambiguation on the words crane, sake, poach, axes, and motion, because the first four words do not frequently occur in the Encarta corpus, and the accuracy of choosing the major translation for the last word already exceeds 98%.</Paragraph>
      <Paragraph position="2"> We next applied BB, MB-B, and MB-D to word translation disambiguation. The parameter settings were the same as those in Experiment 1. Table 7 shows the disambiguation accuracies, and Figures 14-20 show the learning curves for the seven words.</Paragraph>
      <Paragraph position="3"> From the results, we see again that BB significantly outperforms MB-D and MB-B. Note that the results of MB-D here cannot be directly compared with those in Yarowsky (1995), because the data used are different. Naive Bayesian ensemble did not perform well on the word duty, causing the accuracies of both MB-B and BB to deteriorate.</Paragraph>
    </Section>
    <Section position="5" start_page="18" end_page="21" type="sub_section">
      <SectionTitle>
4.5 Discussion
</SectionTitle>
      <Paragraph position="0"> We investigated the reason for BB's outperforming MB and found that the explanation in Section 3.3 appears to be valid according to the following observations.</Paragraph>
      <Paragraph position="1">  1. In a naive Bayesian classifier, words with large values of likelihood ratio</Paragraph>
      <Paragraph position="3"> will have strong influences on classification. We collected the words having the largest likelihood ratio with respect to each sense t in both BB and MB-B and found that BB obviously has more &amp;quot;relevant words&amp;quot; than MB-B. Here words relevant to a particular sense refer to the words that are strongly indicative of that sense according to human judgments.</Paragraph>
      <Paragraph position="4"> Table 8 shows the top 10 words in terms of likelihood ratio with respect to the interest rate sense in both BB and MB-B. The relevant words are italicized. Figure 21 shows the numbers of relevant words with respect to the four senses of interest in BB and MB-B.</Paragraph>
      <Paragraph position="5"> 2. From Figure 13, we see that the performance of BB remains high or gets higher even when a becomes larger than 0.4 (recall that b was fixed at 0.2). This result strongly indicates that the information from Chinese has positive effects.</Paragraph>
      <Paragraph position="6"> 3. One might argue that the higher performance of BB can be attributed to the larger amount of unclassified data it uses, and thus if we increase the amount of unclassified data for MB, it is likely that MB can perform as well as BB. We conducted an additional experiment and found that this is not the case. Figure 22 shows the accuracies achieved by MB-B as the amount of unclassified data increases. The plot shows that the accuracy of MB-B does not improve when the amount of unclassified  When more unclassified data available.</Paragraph>
      <Paragraph position="7"> data increases. Figure 22 plots again the results of BB as well as those of a method referred to as MB-C. In MB-C, we linearly combined two MB-B classifiers constructed with two different unclassified data sets, and we found that although the accuracies are improved in MB-C, they are still much lower than those of BB.</Paragraph>
      <Paragraph position="8"> 4. We have noticed that a key to BB's performance is the asymmetric relationship between the classes in the two languages. Therefore, we tested the performance of MB and BB when the classes in the two languages are symmetric (i.e., one-to-one mapping).</Paragraph>
      <Paragraph position="9"> We performed two experiments on text classification in which the categories were finance and industry, and finance and trade, respectively. We collected Chinese texts from the People's Daily in 1998 that had already been assigned class labels. We used half of them as unclassified training data in Chinese and the remaining as test data in Chinese. We also collected English texts from the Wall Street Journal. We used them as unlabeled training data in English. We used the class names (i.e., finance, industry, and trade, as seed data (classified data)). Table 9 shows the accuracies of text classification. From the results we see that when the classes are symmetric, BB cannot outperform MB.</Paragraph>
      <Paragraph position="10"> 5. We also investigated the effect of the one-sense-per-discourse heuristic. Table 10 shows the performance of MB and BB on the word interest with and without the heuristic. We see that with the heuristic, the performance of both MB and BB is improved. Even without the heuristic, BB still performs better than MB with the heuristic.</Paragraph>
    </Section>
  </Section>
class="xml-element"></Paper>
Download Original XML