File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/06/w06-1649_metho.xml
Size: 15,804 bytes
Last Modified: 2025-10-06 14:10:47
<?xml version="1.0" standalone="yes"?> <Paper uid="W06-1649"> <Title>Sydney, July 2006. c(c)2006 Association for Computational Linguistics Partially Supervised Sense Disambiguation by Learning Sense Number from Tagged and Untagged Corpora</Title> <Section position="4" start_page="0" end_page="415" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper, we address the problem of partially supervised word sense disambiguation, which is to disambiguate the senses of occurrences of a target word in untagged texts when given incomplete tagged corpus 1.</Paragraph> <Paragraph position="1"> Word sense disambiguation can be de ned as associating a target word in a text or discourse 1 incomplete tagged corpus means that tagged corpus does not include the instances of some senses for the target word, while these senses may occur in untagged texts. with a de nition or meaning. Many corpus based methods have been proposed to deal with the sense disambiguation problem when given de nition for each possible sense of a target word or a tagged corpus with the instances of each possible sense, e.g., supervised sense disambiguation (Leacock et al., 1998), and semi-supervised sense disambiguation (Yarowsky, 1995).</Paragraph> <Paragraph position="2"> Supervised methods usually rely on the information from previously sense tagged corpora to determine the senses of words in unseen texts.</Paragraph> <Paragraph position="3"> Semi-supervised methods for WSD are characterized in terms of exploiting unlabeled data in the learning procedure with the need of predened sense inventories for target words. The information for semi-supervised sense disambiguation is usually obtained from bilingual corpora (e.g. parallel corpora or untagged monolingual corpora in two languages) (Brown et al., 1991; Dagan and Itai, 1994), or sense-tagged seed examples (Yarowsky, 1995).</Paragraph> <Paragraph position="4"> Some observations can be made on the previous supervised and semi-supervised methods. They always rely on hand-crafted lexicons (e.g., Word-Net) as sense inventories. But these resources may miss domain-speci c senses, which leads to incomplete sense tagged corpus. Therefore, sense taggers trained on the incomplete tagged corpus will misclassify some instances if the senses of these instances are not de ned in sense inventories. For example, one performs WSD in information technology related texts using WordNet 2 as sense inventory. When disambiguating the word boot in the phrase boot sector , the sense tagger will assign this instance with one of the senses of boot listed in WordNet. But the correct sense loading operating system into memory is not included in WordNet. Therefore, this instance will be associated with an incorrect sense.</Paragraph> <Paragraph position="5"> So, in this work, we would like to study the problem of partially supervised sense disambiguation with an incomplete sense tagged corpus. Speci cally, given an incomplete sense-tagged corpus and a large amount of untagged examples for a target word 3, we are interested in (1) labeling the instances in the untagged corpus with sense tags occurring in the tagged corpus; (2) trying to nd unde ned senses (or new senses) of the target word 4 from the untagged corpus, which will be represented by instances from the untagged corpus. null We propose an automatic method to estimate the number of senses (or sense number, model order) of a target word in mixed data (tagged corpus+untagged corpus) by maximizing a stability criterion de ned on classi cation result over all the possible values of sense number. At the same time, we can obtain a classi cation of the mixed data with the optimal number of groups. If the estimated sense number in the mixed data is equal to the sense number of the target word in tagged corpus, then there is no new sense in untagged corpus. Otherwise new senses will be represented by groups in which there is no instance from the tagged corpus.</Paragraph> <Paragraph position="6"> This partially supervised sense disambiguation algorithm may help enriching manually compiled lexicons by inducing new senses from untagged corpora.</Paragraph> <Paragraph position="7"> This paper is organized as follows. First, a model order identi cation algorithm will be presented for partially supervised sense disambiguation in section 2. Section 3 will provide experimental results of this algorithm for sense disambiguation on SENSEVAL-3 data. Then related work on partially supervised classi cation will be summarized in section 4. Finally we will conclude our work and suggest possible improvements in section 5.</Paragraph> </Section> <Section position="5" start_page="415" end_page="415" type="metho"> <SectionTitle> 2 Partially Supervised Word Sense </SectionTitle> <Paragraph position="0"> Disambiguation The partially supervised sense disambiguation problem can be generalized as a model order iden- null tagged corpus.</Paragraph> <Paragraph position="1"> ti cation problem. We try to estimate the sense number of a target word in mixed data (tagged corpus+untagged corpus) by maximizing a stability criterion de ned on classi cation results over all the possible values of sense number. If the estimated sense number in the mixed data is equal to the sense number in the tagged corpus, then there is no new sense in the untagged corpus. Otherwise new senses will be represented by clusters in which there is no instance from the tagged corpus. The stability criterion assesses the agreement between classi cation results on full mixed data and sampled mixed data. A partially supervised classi cation algorithm is used to classify the full or sampled mixed data into a given number of classes before the stability assessment, which will be presented in section 2.1. Then we will provide the details of the model order identi cation procedure in section 2.2.</Paragraph> <Section position="1" start_page="415" end_page="415" type="sub_section"> <SectionTitle> 2.1 An Extended Label Propagation Algorithm </SectionTitle> <Paragraph position="0"> Function: ELP(DL, DU, k, Y 0DL+DU ) Input: labeled examples DL, unlabeled examples DU, model order k, initial labeling matrix Y 0DL+DU ; Output: the labeling matrix YDU on DU; 1 If k < kXL then YDU =NULL; 2 Else if k = kXL then Run plain label propagation algorithm on DU with YDU as output; 3 Else then 3.1 Estimate the size of tagged data set of new classes; 3.2 Generate tagged examples from DU for (kXL + 1)-th to k-th new classes; 3.3 Run plain label propagation algorithm on DU with augmented tagged dataset as labeled data; 3.4 YDU is the output from plain label propagation algorithm; End if</Paragraph> </Section> </Section> <Section position="6" start_page="415" end_page="417" type="metho"> <SectionTitle> 4 Return YDU ; </SectionTitle> <Paragraph position="0"> Let XL+U = fxigni=1 be a set of contexts of occurrences of an ambiguous word w, where xi represents the context of the i-th occurrence, and n is the total number of this word's occurrences. Let SL = fsjgcj=1 denote the sense tag set of w in XL, where XL denotes the rst l examples xg(1 g l) that are labeled as yg (yg 2 SL). Let XU denote other u (l + u = n) examples xh(l + 1 h n) that are unlabeled.</Paragraph> <Paragraph position="1"> Let Y 0XL+U 2 N|XL+U|x|SL |represent initial soft labels attached to tagged instances, where Y 0XL+U,ij = 1 if yi is sj and 0 otherwise. Let Y 0XL be the top l rows of Y 0XL+U and Y 0XU be the remaining u rows. Y 0XL is consistent with the labeling in labeled data, and the initialization of Y 0XU can be arbitrary.</Paragraph> <Paragraph position="2"> Let k denote the possible value of the number of senses in mixed data XL+U , and kXL be the number of senses in initial tagged data XL. Note that kXL = jSLj, and k kXL.</Paragraph> <Paragraph position="3"> The classi cation algorithm in the order identication process should be able to accept labeled data DL 5, unlabeled data DU 6 and model order k as input, and assign a class label or a cluster index to each instance in DU as output. Previous supervised or semi-supervised algorithms (e.g. SVM, label propagation algorithm (Zhu and Ghahramani, 2002)) cannot classify the examples in DU into k groups if k > kXL. The semi-supervised k-means clustering algorithm (Wagstaff et al., 2001) may be used to perform clustering analysis on mixed data, but its ef ciency is a problem for clustering analysis on a very large dataset since multiple restarts are usually required to avoid local optima and multiple iterations will be run in each clustering process for optimizing a clustering solution. null In this work, we propose an alternative method, an extended label propagation algorithm (ELP), which can classify the examples in DU into k groups. If the value of k is equal to kXL, then ELP is identical with the plain label propagation algorithm (LP) (Zhu and Ghahramani, 2002). Otherwise, if the value of k is greater than kXL, we perform classi cation by the following steps: (1) estimate the dataset size of each new class as sizenew class by identifying the examples of new classes using the Spy technique 7 and assuming Our re-implementation of this technique consists of three steps: (1) sample a small subset DsL with the size 15%x|DL| from DL; (2) train a classi er with tagged data DL [?] DsL; (3) classify DU and DsL, and then select some examples from DU as the dataset of new classes, which have the classi cathat new classes are equally distributed; (2) DprimeL = DL, DprimeU = DU; (3) remove tagged examples of the m-th new class (kXL + 1 m k) from DprimeL 8 and train a classi er on this labeled dataset without the m-th class; (4) the classi er is then used to classify the examples in DprimeU; (5) the least con dently unlabeled point xclass m 2 DprimeU, together with its label m, is added to the labeled data DprimeL = DprimeL + xclass m, and</Paragraph> <Paragraph position="5"> (6) steps (3) to (5) are repeated for each new class till the augmented tagged data set is large enough (here we try to select sizenew class/4 examples with their sense tags as tagged data for each new class); (7) use plain LP algorithm to classify remaining unlabeled data DprimeU with DprimeL as labeled data. Table 1 shows this extended label propagation algorithm.</Paragraph> <Paragraph position="6"> Next we will provide the details of the plain label propagation algorithm.</Paragraph> <Paragraph position="7"> De ne Wij = exp( d ij s2 ) if i 6= j and Wii = 0(1 i, j jD L + DUj), where dij is the distance (e.g., Euclidean distance) between the example xi and xj, and s is used to control the weight Wij.</Paragraph> <Paragraph position="9"> where Tij is the probability to jump from example xj to example xi.</Paragraph> <Paragraph position="10"> Compute the row-normalized matrix T by</Paragraph> <Paragraph position="12"> identity matrix. T uu and Tul are acquired by splitting matrix T after thejDLj-th row and thejDLj-th column into 4 sub-matrices.</Paragraph> <Section position="1" start_page="416" end_page="417" type="sub_section"> <SectionTitle> 2.2 Model Order Identi cation Procedure </SectionTitle> <Paragraph position="0"> For achieving the model order identi cation (or sense number estimation) ability, we use a cluster validation based criterion (Levine and Domany, 2001) to infer the optimal number of senses of w in XL+U .</Paragraph> <Paragraph position="1"> tion con dence less than the average of that in DsL. Classi cation con dence of the example xi is de ned as the absolute value of the difference between two maximum values from the i-th row in labeling matrix.</Paragraph> <Paragraph position="2"> 8Initially there are no tagged examples for the m-th class in DprimeL. Therefore we do not need to remove tagged examples for this new class, and then directly train a classi er with DprimeL. Function: CV(XL+U , k, q, Y 0XL+U ) Input: data set XL+U , model order k, and sampling frequency q; Output: the score of the merit of k; 1 Run the extended label propagation algorithm with XL, XU, k and Y 0XL+U ; 2 Construct connectivity matrix Ck based on above classi cation solution on XU ; 3 Use a random predictor rk to assign uniformly drawn labels to each vector in XU ; 4 Construct connectivity matrix Crk using above classi cation solution on XU; 5 For u = 1 to q do 5.1 Randomly sample a subset XuL+U with the size ajXL+Uj from XL+U , 0 < a < 1; 5.2 Run the extended label propagation algorithm with XuL, XuU , k and Y 0u; 5.3 Construct connectivity matrix Cuk using above classi cation solution on XuU; 5.4 Use rk to assign uniformly drawn labels to each vector in XuU; 5.5 Construct connectivity matrix Curk using above classi cation solution on XuU; Endfor 6 Evaluate the merit of k using following formula:</Paragraph> <Paragraph position="4"> where M(Cu, C) is given by equation (2);</Paragraph> </Section> </Section> <Section position="7" start_page="417" end_page="417" type="metho"> <SectionTitle> 7 Return Mk; </SectionTitle> <Paragraph position="0"> Then this model order identi cation procedure can be formulated as:</Paragraph> <Paragraph position="2"> L+U is the estimated sense number in XL+U , Kmin (or Kmax) is the minimum (or maximum) value of sense number, and k is the possible value of sense number in XL+U . Note that k kXL.</Paragraph> <Paragraph position="3"> Then we set Kmin = kXL. Kmax may be set as a value greater than the possible ground-truth value. CV is a cluster validation based evaluation function. Table 2 shows the details of this function. We set q, the resampling frequency for estimation of stability score, as 20. a is set as 0.90. The random predictor assigns uniformly distributed class labels to each instance in a given dataset. We run this CV procedure for each value of k. The value of k that maximizes this function will be selected as the estimation of sense number. At the same time, we can obtain a partition of XL+U with</Paragraph> <Paragraph position="5"> (2) where XuU is the untagged data in XuL+U , XuL+U is a subset with the size ajXL+Uj (0 < a < 1) sampled from XL+U , C or Cu is jXUj jXUj or jXuUj jXuUj connectivity matrix based on classication solutions computed on XU or XuU respectively. The connectivity matrix C is de ned as: Ci,j = 1 if xi and xj belong to the same cluster, otherwise Ci,j = 0. Cu is calculated in the same way.</Paragraph> <Paragraph position="6"> M(Cu, C) measures the proportion of example pairs in each group computed on XU that are also assigned into the same group by the classi cation solution on XuU. Clearly, 0 M 1. Intuitively, if the value of k is identical with the true value of sense number, then classi cation results on the different subsets generated by sampling should be similar with that on the full dataset. In the other words, the classi cation solution with the true model order as parameter is robust against resampling, which gives rise to a local optimum of M(Cu, C).</Paragraph> <Paragraph position="7"> In this algorithm, we normalize M(Cuk , Ck) by the equation in step 6 of Table 2, which makes our objective function different from the gure of merit (equation ( 2)) proposed in (Levine and Domany, 2001). The reason to normalize M(Cuk , Ck) is that M(Cuk , Ck) tends to decrease when increasing the value of k (Lange et al., 2002). Therefore for avoiding the bias that the smaller value of k is to be selected as the model order, we use the cluster validity of a random predictor to normalize M(Cuk , Ck).</Paragraph> <Paragraph position="8"> If ^kXL+U is equal to kXL, then there is no new sense in XU. Otherwise (^kXL+U > kXL) new senses of w may be represented by the groups in which there is no instance from XL.</Paragraph> </Section> class="xml-element"></Paper>