File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/p06-1012_metho.xml
Size: 15,967 bytes
Last Modified: 2025-10-06 14:10:18
<?xml version="1.0" standalone="yes"?> <Paper uid="P06-1012"> <Title>Estimating Class Priors in Domain Adaptation for Word Sense Disambiguation</Title> <Section position="4" start_page="89" end_page="91" type="metho"> <SectionTitle> 2 Estimation of Priors </SectionTitle> <Paragraph position="0"> To estimate the sense priors, or a priori probabilities of the different senses in a new dataset, we used a confusion matrix algorithm (Vucetic and Obradovic, 2001) and an EM based algorithm (Saerens et al., 2002) in (Chan and Ng, 2005b).</Paragraph> <Paragraph position="1"> Our results in (Chan and Ng, 2005b) indicate that the EM based algorithm is effective in estimating the sense priors and achieves greater improvements in WSD accuracy compared to the confusion matrix algorithm. Hence, to estimate the sense priors in our current work, we use the EM based algorithm, which we describe in this section. null</Paragraph> <Section position="1" start_page="89" end_page="90" type="sub_section"> <SectionTitle> 2.1 EM Based Algorithm </SectionTitle> <Paragraph position="0"> Most of this section is based on (Saerens et al., 2002). Assume we have a set of labeled data Da0 with n classes and a set of N independent instances a1 a2 a3a5a4a7a6a8a6a7a6a9a4 a2 a10 a11 from a new data set. The likelihood of these N instances can be defined as:</Paragraph> <Paragraph position="2"> i.e., the probabilities of observing a2 a18 given the class a27 , do not change from the training set Da0 to the new data set, we can define: a11 of the new data set that will maximize the likelihood of (1) with respect to</Paragraph> <Paragraph position="4"> we can apply the iterative procedure of the EM algorithm. In effect, through maximizing the likelihood of (1), we obtain the a priori probability estimates as a by-product.</Paragraph> <Paragraph position="5"> Let us now define some notations. When we apply a classifier trained on Da0 on an instance</Paragraph> <Paragraph position="7"> as the a priori probabilities of class a27 in Da0 . This can be estimated by the class frequency of a27</Paragraph> <Paragraph position="9"> timates of the new a priori and a posteriori probabilities at step s of the iterative EM procedure. Assuming we initialize a38</Paragraph> <Paragraph position="11"> algorithm provides the following iterative steps:</Paragraph> <Paragraph position="13"> where Equation (2) represents the expectation Estep, Equation (3) represents the maximization Mstep, and N represents the number of instances in In our eariler work (Chan and Ng, 2005b), the posterior probabilities assigned by a naive Bayes classifier are used by the EM procedure described in the previous section to estimate the sense pri-</Paragraph> <Paragraph position="15"> a11 in a new dataset. However, it is known that the posterior probabilities assigned by naive Bayes are not well calibrated (Domingos and Pazzani, 1996).</Paragraph> <Paragraph position="16"> It is important to use an algorithm which gives well calibrated probabilities, if we are to use the probabilities in estimating the sense priors. In this section, we will first describe the notion of being well calibrated before discussing why having well calibrated probabilities helps in estimating the sense priors. Finally, we will introduce a method used to calibrate the probabilities from naive Bayes.</Paragraph> </Section> <Section position="2" start_page="90" end_page="90" type="sub_section"> <SectionTitle> 3.1 Well Calibrated Probabilities </SectionTitle> <Paragraph position="0"> Assume for each instance a2 , a classifier outputs a probability Sa8a12a9 a1 a2 a11 between 0 and 1, of classified goes to infinity (Zadrozny and Elkan, 2002). Intuitively, if we consider all the instances to which the classifier assigns a probability Sa8a10a9 a1 a2 a11 of say 0.6, then 60% of these instances should be members of class a27</Paragraph> <Paragraph position="2"/> </Section> <Section position="3" start_page="90" end_page="91" type="sub_section"> <SectionTitle> 3.2 Being Well Calibrated Helps Estimation </SectionTitle> <Paragraph position="0"> To see why using an algorithm which gives well calibrated probabilities helps in estimating the sense priors, let us rewrite Equation (3), the M-step of the EM procedure, as the following:</Paragraph> <Paragraph position="2"> where Sa8a12a9 =a48a30a9a3a7a4a7a6a8a6a7a6a9a4a41a30a33a49 a50 denotes the set of posterior probability values for class a27</Paragraph> <Paragraph position="4"> denotes the posterior probability of class a27 assigned by the classifier for instance a2</Paragraph> <Paragraph position="6"> Based on a30a14a3a7a4a7a6a7a6a8a6a9a4a39a30a17a49 , we can imagine that we have a51 bins, where each bin is associated with a specific a30 value. Now, distribute all the instances in the new dataset Da40 into the a51 bins according to their posterior probabilitiesa52 a8a12a9 a1 a2 a11 . Let Ba53, for</Paragraph> <Paragraph position="8"> a53 denote the proportion of instances with true class label a27 in Ba53. Given a well calibrated</Paragraph> <Paragraph position="10"> Input: training set a0a1a3a2 a4 a5a6a2 a7 sorted in ascending order ofa1a8a2 Initialize a9a10a2 a11 a5a6a2 While a12 k such that a9 a15 a4a14a13a15a13a15a13a15a4 a9a16a2 a17a19a18a21a20 a9a16a2 a4a15a13a14a13a15a13a15a4 a9a16a22 , where</Paragraph> <Paragraph position="12"> wherea19 a8a10a9 denotes the number of instances in Da40 with true class label a27</Paragraph> <Paragraph position="14"> flects the proportion of instances in Da40 with true class label a27 . Hence, using an algorithm which gives well calibrated probabilities helps in the estimation of sense priors.</Paragraph> </Section> <Section position="4" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 3.3 Isotonic Regression </SectionTitle> <Paragraph position="0"> Zadrozny and Elkan (2002) successfully used a method based on isotonic regression (Robertson et al., 1988) to calibrate the probability estimates from naive Bayes. To compute the isotonic regression, they used the pair-adjacent violators (PAV) (Ayer et al., 1955) algorithm, which we show in Figure 1. Briefly, what PAV does is to initially view each data value as a level set. While there are two adjacent sets that are out of order (i.e., the left level set is above the right one) then the sets are combined and the mean of the data values becomes the value of the new level set.</Paragraph> <Paragraph position="1"> PAV works on binary class problems. In a binary class problem, we have a positive class and a negative class. Now, let a45</Paragraph> <Paragraph position="3"> ing to the positive class, as predicted by a classifier. Further, let a51 a18 represent the true label of</Paragraph> <Paragraph position="5"> For a binary class problem, we let a51 a18 a15 a18 if a2 a18 is a positive example and a51 a18 a15a53a52 if a2 a18 is a negative example. The PAV algorithm takes in a set assign a54 a14a14a55a53 as the calibrated probability estimate. To apply PAV on a multiclass problem, we first reduce the problem into a number of binary class problems. For reducing a multiclass problem into a set of binary class problems, experiments in (Zadrozny and Elkan, 2002) suggest that the one-against-all approach works well. In one-againstall, a separate classifier is trained for each class a27</Paragraph> <Paragraph position="7"> where examples belonging to class a27 are treated as positive examples and all other examples are treated as negative examples. A separate classifier is then learnt for each binary class problem and the probability estimates from each classifier are calibrated. Finally, the calibrated binary-class probability estimates are combined to obtain multiclass probabilities, computed by a simple normalization of the calibrated estimates from each binary classifier, as suggested by Zadrozny and Elkan (2002).</Paragraph> </Section> </Section> <Section position="5" start_page="91" end_page="92" type="metho"> <SectionTitle> 4 Selection of Dataset </SectionTitle> <Paragraph position="0"> In this section, we discuss the motivations in choosing the particular corpora and the set of words used in our experiments.</Paragraph> <Section position="1" start_page="91" end_page="91" type="sub_section"> <SectionTitle> 4.1 DSO Corpus </SectionTitle> <Paragraph position="0"> The DSO corpus (Ng and Lee, 1996) contains 192,800 annotated examples for 121 nouns and 70 verbs, drawn from BC and WSJ. BC was built as a balanced corpus and contains texts in various categories such as religion, fiction, etc. In contrast, the focus of the WSJ corpus is on financial and business news. Escudero et al. (2000) exploited the difference in coverage between these two corpora to separate the DSO corpus into its BC and WSJ parts for investigating the domain dependence of several WSD algorithms. Following their setup, we also use the DSO corpus in our experiments.</Paragraph> <Paragraph position="1"> The widely used SEMCOR (SC) corpus (Miller et al., 1994) is one of the few currently available manually sense-annotated corpora for WSD.</Paragraph> <Paragraph position="2"> SEMCOR is a subset of BC. Since BC is a balanced corpus, and training a classifier on a general corpus before applying it to a more specific corpus is a natural scenario, we will use examples from BC as training data, and examples from WSJ as evaluation data, or the target dataset.</Paragraph> </Section> <Section position="2" start_page="91" end_page="92" type="sub_section"> <SectionTitle> 4.2 Parallel Texts </SectionTitle> <Paragraph position="0"> Scalability is a problem faced by current supervised WSD systems, as they usually rely on manually annotated data for training. To tackle this problem, in one of our recent work (Ng et al., 2003), we had gathered training data from parallel texts and obtained encouraging results in our evaluation on the nouns of SENSEVAL-2 English lexical sample task (Kilgarriff, 2001). In another recent evaluation on the nouns of SENSEVAL-2 English all-words task (Chan and Ng, 2005a), promising results were also achieved using examples gathered from parallel texts. Due to the potential of parallel texts in addressing the issue of scalability, we also drew training data for our earlier sense priors estimation experiments (Chan and Ng, 2005b) from parallel texts. In addition, our parallel texts training data represents a natural domain difference with the test data of SENSEVAL-2 English lexical sample task, of which 91% is drawn from the British National Corpus (BNC).</Paragraph> <Paragraph position="1"> As part of our experiments, we followed the experimental setup of our earlier work (Chan and Ng, 2005b), using the same 6 English-Chinese parallel corpora (Hong Kong Hansards, Hong Kong News, Hong Kong Laws, Sinorama, Xinhua News, and English translation of Chinese Treebank), available from Linguistic Data Consortium. To gather training examples from these parallel texts, we used the approach we described in (Ng et al., 2003) and (Chan and Ng, 2005b). We then evaluated our estimation of sense priors on the nouns of SENSEVAL-2 English lexical sample task, similar to the evaluation we conducted in (Chan and Ng, 2005b). Since the test data for the nouns of SENSEVAL-3 English lexical sample task (Mihalcea et al., 2004) were also drawn from BNC and represented a difference in domain from the parallel texts we used, we also expanded our evaluation to these SENSEVAL-3 nouns.</Paragraph> </Section> <Section position="3" start_page="92" end_page="92" type="sub_section"> <SectionTitle> 4.3 Choice of Words </SectionTitle> <Paragraph position="0"> Research by (McCarthy et al., 2004) highlighted that the sense priors of a word in a corpus depend on the domain from which the corpus is drawn.</Paragraph> <Paragraph position="1"> A change of predominant sense is often indicative of a change in domain, as different corpora drawn from different domains usually give different predominant senses. For example, the predominant sense of the noun interest in the BC part of the DSO corpus has the meaning &quot;a sense of concern with and curiosity about someone or something&quot;.</Paragraph> <Paragraph position="2"> In the WSJ part of the DSO corpus, the noun interest has a different predominant sense with the meaning &quot;a fixed charge for borrowing money&quot;, reflecting the business and finance focus of the WSJ corpus.</Paragraph> <Paragraph position="3"> Estimation of sense priors is important when there is a significant change in sense priors between the training and target dataset, such as when there is a change in domain between the datasets.</Paragraph> <Paragraph position="4"> Hence, in our experiments involving the DSO corpus, we focused on the set of nouns and verbs which had different predominant senses between the BC and WSJ parts of the corpus. This gave us a set of 37 nouns and 28 verbs. For experiments involving the nouns of SENSEVAL-2 and SENSEVAL-3 English lexical sample task, we used the approach we described in (Chan and Ng, 2005b) of sampling training examples from the parallel texts using the natural (empirical) distribution of examples in the parallel texts. Then, we focused on the set of nouns having different predominant senses between the examples gathered from parallel texts and the evaluation data for the two SENSEVAL tasks. This gave a set of 6 nouns for SENSEVAL-2 and 9 nouns for SENSEVAL3. For each noun, we gathered a maximum of 500 parallel text examples as training data, similar to what we had done in (Chan and Ng, 2005b).</Paragraph> </Section> </Section> <Section position="6" start_page="92" end_page="93" type="metho"> <SectionTitle> 5 Experimental Results </SectionTitle> <Paragraph position="0"> Similar to our previous work (Chan and Ng, 2005b), we used the supervised WSD approach described in (Lee and Ng, 2002) for our experiments, using the naive Bayes algorithm as our classifier. Knowledge sources used include partsof-speech, surrounding words, and local collocations. This approach achieves state-of-the-art accuracy. All accuracies reported in our experiments are micro-averages over all test examples.</Paragraph> <Paragraph position="1"> In (Chan and Ng, 2005b), we used a multiclass naive Bayes classifier (denoted by NB) for each word. Following this approach, we noted the WSD accuracies achieved without any adjustment, in the column L under NB in Table 1. The predictions a11 , before being adjusted by these estimated sense priors based on Equation (4). The resulting WSD accuracies after adjustment are listed in the column EMa10a1a0 in Table 1, representing the WSD accuracies achievable by following the approach we described in (Chan and Ng, 2005b).</Paragraph> <Paragraph position="2"> Next, we used the one-against-all approach to reduce each multiclass problem into a set of binary class problems. We trained a naive Bayes classifier for each binary problem and calibrated the probabilities from these binary classifiers. The WSD brated probabilities.</Paragraph> <Paragraph position="3"> accuracies of these calibrated naive Bayes classifiers (denoted by NBcal) are given in the column L under NBcal.1 The predictions of these classifiers are then used to estimate the sense priors a38 The results show that calibrating the probabilities improves WSD accuracy. In particular,</Paragraph> <Paragraph position="5"> a53 achieves the highest accuracy among the methods described so far. To provide a basis for comparison, we also adjusted the calibrated probabilities by the true sense priors data. The increase in WSD accuracy thus obtained is given in the column True a10 L in Table 2. Note that this represents the maximum possible increase in accuracy achievable provided we know these true sense priors tive improvement compared to using the true sense priors is 2.6/10.3 = 25.2%, as shown in Table 2.</Paragraph> </Section> class="xml-element"></Paper>