File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/04/j04-1001_relat.xml
Size: 8,477 bytes
Last Modified: 2025-10-06 14:15:42
<?xml version="1.0" standalone="yes"?> <Paper uid="J04-1001"> <Title>c(c) 2004 Association for Computational Linguistics Word Translation Disambiguation Using Bilingual Bootstrapping</Title> <Section position="3" start_page="2" end_page="6" type="relat"> <SectionTitle> 2. Related Work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="2" end_page="3" type="sub_section"> <SectionTitle> 2.1 Word Translation Disambiguation </SectionTitle> <Paragraph position="0"> Word translation disambiguation (in general, word sense disambiguation) can be viewed as a problem of classification and can be addressed by employing various supervised learning methods. For example, with such a learning method, an English sentence containing an ambiguous English word corresponds to an instance, and the Chinese translation of the word in the context (i.e., the word sense) corresponds to a classification decision (a label).</Paragraph> <Paragraph position="1"> Many methods for word sense disambiguation based on supervised learning technique have been proposed. They include those using naive Bayes (Gale, Church, and Yarowsky 1992a), decision lists (Yarowsky 1994), nearest neighbor (Ng and Lee 1996), transformation-based learning (Mangu and Brill 1997), neural networks (Towell and Voorhees 1998), Winnow (Golding and Roth 1999), boosting (Escudero, Marquez, and Rigau 2000), and naive Bayesian ensemble (Pedersen 2000). The assumption behind these methods is that it is nearly always possible to determine the sense of an ambiguous word by referring to its context, and thus all of the methods build a classifier (i.e., a classification program) using features representing context information (e.g., surrounding context words). For other related work on translation disambiguation, see Brown et al. (1991), Bruce and Weibe (1994), Dagan and Itai (1994), Lin (1997), Pedersen and Bruce (1997), Schutze (1998), Kikui (1999), Mihalcea and Moldovan (1999), Koehn and Knight (2000), and Zhou, Ding, and Huang (2001).</Paragraph> <Paragraph position="2"> Let us formulate the problem of word sense (translation) disambiguation as follows. Let E denote a set of words. Let e denote an ambiguous word in E, and let e Li and Li Word Translation Disambiguation Using Bilingual Bootstrapping denote a context word in E. (Throughout this article, we use Greek letters to represent ambiguous words and italic letters to represent context words.) Let T e denote the set of senses of e, and let t e denote a sense in T For the example presented earlier, we have e = plant, T e = {1, 2}, where 1 represents the sense factory and 2 the sense flora. From the phrase &quot;...computer manufacturing plant and adjacent...&quot; we obtain e e = (...computer, manufacturing, (plant), and, adjacent, ...).</Paragraph> <Paragraph position="3"> For a specific e, we define a binary classifier for resolving each of its ambiguities in T e in a general form as</Paragraph> <Paragraph position="5"> denotes an instance representing a context of e. All of the supervised learning methods mentioned previously can automatically create such a classifier. To construct classifiers using supervised methods, we need classified data such as those in Figure 1.</Paragraph> </Section> <Section position="2" start_page="3" end_page="4" type="sub_section"> <SectionTitle> 2.2 Decision Lists </SectionTitle> <Paragraph position="0"> Let us first consider the use of decision lists, as proposed in Yarowsky (1994). Let f e denote a feature of the context of e. A feature can be, for example, a word's occurrence immediately to the left of e. We define many such features. For each feature f e ,we use the classified data to calculate the posterior probability ratio of each sense t e with respect to the feature as We sort the rules in descending order with respect to their scores, provided that the scores of the rules are larger than the default</Paragraph> <Paragraph position="2"> The sorted rules form an if-then-else type of rule sequence, that is, a decision list.</Paragraph> </Section> <Section position="3" start_page="4" end_page="4" type="sub_section"> <SectionTitle> 2.3 Naive Bayesian Ensemble </SectionTitle> <Paragraph position="0"> Let us next consider the use of naive Bayesian classifiers. Given an instance e</Paragraph> <Paragraph position="2"> according to Bayes' rule and select the sense</Paragraph> <Paragraph position="4"> (2) In a naive Bayesian classifier, we assume that the words in e</Paragraph> <Paragraph position="6"> The naive Bayesian ensemble method for word sense disambiguation, as proposed in Pedersen (2000), employs a linear combination of several naive Bayesian classifiers constructed on the basis of a number of nested surrounding contexts</Paragraph> <Paragraph position="8"> The naive Bayesian ensemble is reported to perform the best for word sense disambiguation with respect to a benchmark data set (Pedersen 2000).</Paragraph> </Section> <Section position="4" start_page="4" end_page="6" type="sub_section"> <SectionTitle> 2.4 Monolingual Bootstrapping </SectionTitle> <Paragraph position="0"> Since data preparation for supervised learning is expensive, it is desirable to develop bootstrapping methods. Yarowsky (1995) proposed such a method for word sense disambiguation, which we refer to as monolingual bootstrapping.</Paragraph> <Paragraph position="1"> denote a set of classified instances (labeled data) in English, each representing one context of e:</Paragraph> <Paragraph position="3"> a set of unclassified instances (unlabeled data) in English, each representing one context of e: Computational Linguistics Volume 30, Number 1 performs disambiguation for all the words in E. Note that we can employ any kind of classifier here.</Paragraph> <Paragraph position="4"> At step 1, for each ambiguous word e we create binary classifiers for resolving its ambiguities (cf. lines 1-3 of Figure 2). At step 2, we use the classifiers for each word e to select some unclassified instances from U e , classify them, and add them to L e (cf.</Paragraph> <Paragraph position="5"> lines 4-19). We repeat the process until all the data are classified. Lines 9-13 show that for each unclassified instance e e , we classify it as having sense t if t's posterior odds are the largest among the possible senses and are larger than a threshold th. For each class t, we store the classified instances in S</Paragraph> <Paragraph position="7"> show that for each class t, we only choose the top b classified instances in terms of the posterior odds. For each class t, we store the selected top b classified instances in Q</Paragraph> <Paragraph position="9"> Lines 16-17 show that we create the classified instances by combining the instances with their classification labels.</Paragraph> <Paragraph position="10"> After line 17, we can employ the one-sense-per-discourse heuristic to further classify unclassified data, as proposed in Yarowsky (1995). This heuristic is based on the observation that when an ambiguous word appears in the same text several times, its tokens usually refer to the same sense. In the bootstrapping process, for each newly classified instance, we automatically assign its class label to those unclassified instances that also contain the same ambiguous word and co-occur with it in the same text.</Paragraph> <Paragraph position="11"> Hereafter, we will refer to this method as monolingual bootstrapping with one sense per discourse. This method can be viewed as a special case of co-training (Blum and Mitchell 1998).</Paragraph> </Section> <Section position="5" start_page="6" end_page="6" type="sub_section"> <SectionTitle> 2.5 Co-training </SectionTitle> <Paragraph position="0"> Monolingual bootstrapping augmented with the one-sense-per-discourse heuristic can be viewed as a special case of co-training, as proposed by Blum and Mitchell (1998) (see also Collins and Singer 1999; Nigam et al. 2000; and Nigam and Ghani 2000). Co-training conducts two bootstrapping processes in parallel and makes them collaborate with each other. More specifically, co-training begins with a small number of classified data and a large number of unclassified data. It trains two classifiers from the classified data, uses each of the two classifiers to classify some unclassified data, makes the two classifiers exchange their classified data, and repeats the process.</Paragraph> </Section> </Section> class="xml-element"></Paper>