File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/05/p05-1049_metho.xml
Size: 13,361 bytes
Last Modified: 2025-10-06 14:09:47
<?xml version="1.0" standalone="yes"?> <Paper uid="P05-1049"> <Title>Word Sense Disambiguation Using Label Propagation Based Semi-Supervised Learning</Title> <Section position="3" start_page="0" end_page="395" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> In this paper, we address the problem of word sense disambiguation (WSD), which is to assign an appropriate sense to an occurrence of a word in a given context. Many methods have been proposed to deal with this problem, including supervised learning algorithms (Leacock et al., 1998), semi-supervised learning algorithms (Yarowsky, 1995), and unsupervised learning algorithms (Sch&quot;utze, 1998).</Paragraph> <Paragraph position="1"> Supervised sense disambiguation has been very successful, but it requires a lot of manually sense-tagged data and can not utilize raw unannotated data that can be cheaply acquired. Fully unsupervised methods do not need the definition of senses and manually sense-tagged data, but their sense clustering results can not be directly used in many NLP tasks since there is no sense tag for each instance in clusters. Considering both the availability of a large amount of unlabelled data and direct use of word senses, semi-supervised learning methods have received great attention recently.</Paragraph> <Paragraph position="2"> Semi-supervised methods for WSD are characterized in terms of exploiting unlabeled data in learning procedure with the requirement of predefined sense inventory for target words. They roughly fall into three categories according to what is used for supervision in learning process: (1) using external resources, e.g., thesaurus or lexicons, to disambiguate word senses or automatically generate sense-tagged corpus, (Lesk, 1986; Lin, 1997; McCarthy et al., 2004; Seo et al., 2004; Yarowsky, 1992), (2) exploiting the differences between mapping of words to senses in different languages by the use of bilingual corpora (e.g. parallel corpora or untagged monolingual corpora in two languages) (Brown et al., 1991; Dagan and Itai, 1994; Diab and Resnik, 2002; Li and Li, 2004; Ng et al., 2003), (3) bootstrapping sense-tagged seed examples to overcome the bottleneck of acquisition of large sense-tagged data (Hearst, 1991; Karov and Edelman, 1998; Mihalcea, 2004; Park et al., 2000; Yarowsky, 1995).</Paragraph> <Paragraph position="3"> As a commonly used semi-supervised learning method for WSD, bootstrapping algorithm works by iteratively classifying unlabeled examples and adding confidently classified examples into labeled dataset using a model learned from augmented labeled dataset in previous iteration. It can be found that the affinity information among unlabeled examples is not fully explored in this bootstrapping process. Bootstrapping is based on a local consistency assumption: examples close to labeled examples within same class will have same labels, which is also the assumption underlying many supervised learning algorithms, such as kNN.</Paragraph> <Paragraph position="4"> Recently a promising family of semi-supervised learning algorithms are introduced, which can effectively combine unlabeled data with labeled data in learning process by exploiting cluster structure in data (Belkin and Niyogi, 2002; Blum et al., 2004; Chapelle et al., 1991; Szummer and Jaakkola, 2001; Zhu and Ghahramani, 2002; Zhu et al., 2003).</Paragraph> <Paragraph position="5"> Here we investigate a label propagation based semi-supervised learning algorithm (LP algorithm) (Zhu and Ghahramani, 2002) for WSD, which works by representing labeled and unlabeled examples as vertices in a connected graph, then iteratively propagating label information from any vertex to nearby vertices through weighted edges, finally inferring the labels of unlabeled examples after this propagation process converges.</Paragraph> <Paragraph position="6"> Compared with bootstrapping, LP algorithm is based on a global consistency assumption. Intuitively, if there is at least one labeled example in each cluster that consists of similar examples, then unlabeled examples will have the same labels as labeled examples in the same cluster by propagating the label information of any example to nearby examples according to their proximity.</Paragraph> <Paragraph position="7"> This paper is organized as follows. First, we will formulate WSD problem in the context of semi-supervised learning in section 2. Then in section 3 we will describe LP algorithm and discuss the difference between a supervised learning algorithm (SVM), bootstrapping algorithm and LP algorithm.</Paragraph> <Paragraph position="8"> Section 4 will provide experimental results of LP algorithm on widely used benchmark corpora. Finally we will conclude our work and suggest possible improvement in section 5.</Paragraph> </Section> <Section position="4" start_page="395" end_page="395" type="metho"> <SectionTitle> 2 Problem Setup </SectionTitle> <Paragraph position="0"> Let X = {xi}ni=1 be a set of contexts of occurrences of an ambiguous word w, where xi represents the context of the i-th occurrence, and n is the total number of this word's occurrences. Let S = {sj}cj=1 denote the sense tag set of w. The first l examples xg(1 [?] g [?] l) are labeled as yg (yg [?] S) and other u (l+u = n) examples xh(l+1 [?] h [?] n) are unlabeled. The goal is to predict the sense of w in context xh by the use of label information of xg and similarity information among examples in X.</Paragraph> <Paragraph position="1"> The cluster structure in X can be represented as a connected graph, where each vertex corresponds to an example, and the edge between any two examples xi and xj is weighted so that the closer the vertices in some distance measure, the larger the weight associated with this edge. The weights are defined as follows: Wij = exp([?]d ij s2 ) if i negationslash= j and Wii = 0(1 [?] i,j [?] n), where d ij is the distance (ex. Euclidean distance) between xi and xj, and s is used to control the weight Wij.</Paragraph> </Section> <Section position="5" start_page="395" end_page="397" type="metho"> <SectionTitle> 3 Semi-supervised Learning Method </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="395" end_page="396" type="sub_section"> <SectionTitle> 3.1 Label Propagation Algorithm </SectionTitle> <Paragraph position="0"> In LP algorithm (Zhu and Ghahramani, 2002), label information of any vertex in a graph is propagated to nearby vertices through weighted edges until a global stable stage is achieved. Larger edge weights allow labels to travel through easier. Thus the closer the examples, more likely they have similar labels (the global consistency assumption).</Paragraph> <Paragraph position="1"> In label propagation process, the soft label of each initial labeled example is clamped in each iteration to replenish label sources from these labeled data.</Paragraph> <Paragraph position="2"> Thus the labeled data act like sources to push out labels through unlabeled data. With this push from labeled examples, the class boundaries will be pushed through edges with large weights and settle in gaps along edges with small weights. If the data structure fits the classification goal, then LP algorithm can use these unlabeled data to help learning classification plane.</Paragraph> <Paragraph position="3"> Let Y 0 [?] Nnxc represent initial soft labels attached to vertices, where Y 0ij = 1 if yi is sj and 0 otherwise. Let Y 0L be the top l rows of Y 0 and Y 0U be the remaining u rows. Y 0L is consistent with the labeling in labeled data, and the initialization of Y 0U can be arbitrary.</Paragraph> <Paragraph position="4"> Optimally we expect that the value of Wij across different classes is as small as possible and the value of Wij within same class is as large as possible.</Paragraph> <Paragraph position="5"> This will make label propagation to stay within same class. In later experiments, we set s as the average distance between labeled examples from different classes.</Paragraph> <Paragraph position="6"> Define nxn probability transition matrix Tij =</Paragraph> <Paragraph position="8"> , where Tij is the probability to jump from example xj to example xi. Compute the row-normalized matrix T by Tij = Tij/summationtextnk=1 Tik. This normalization is to maintain the class probability interpretation of Y .</Paragraph> <Paragraph position="9"> (a) Two-moon pattern dataset with two labeled points, (b) classification result by SVM, (c) labeling procedure of bootstrapping algorithm, (d) ideal classification. Then LP algorithm is defined as follows: 1. Initially set t=0, where t is iteration index; 2. Propagate the label by Y t+1 = TY t; 3. Clamp labeled data by replacing the top l row of Y t+1 with Y 0L. Repeat from step 2 until Y t converges; null 4. Assign xh(l + 1 [?] h [?] n) with a label s^j, where ^j = argmaxjYhj.</Paragraph> <Paragraph position="10"> This algorithm has been shown to converge to a unique solution, which is hatwideYU = limt-[?]Y tU =</Paragraph> <Paragraph position="12"> We can see that this solution can be obtained without iteration and the initialization of Y 0U is not important, since Y 0U does not affect the estimation of hatwideYU. I is u x u identity matrix. Tuu and Tul are acquired by splitting matrix T after the l-th row and the l-th column into 4 sub-matrices.</Paragraph> </Section> <Section position="2" start_page="396" end_page="397" type="sub_section"> <SectionTitle> 3.2 Comparison between SVM, Bootstrapping and LP </SectionTitle> <Paragraph position="0"> For WSD, SVM is one of the state of the art supervised learning algorithms (Mihalcea et al., 2004), while bootstrapping is one of the state of the art semi-supervised learning algorithms (Li and Li, 2004; Yarowsky, 1995). For comparing LP with SVM and bootstrapping, let us consider a dataset with two-moon pattern shown in Figure 1(a). The upper moon consists of 9 points, while the lower moon consists of 13 points. There is only one labeled point in each moon, and other 20 points are un[?]2 [?]1 0 1 2 3[?]2 gence process of LP algorithm with t varying from 1 to 100 is shown from (b) to (f).</Paragraph> <Paragraph position="1"> labeled. The distance metric is Euclidian distance. We can see that the points in one moon should be more similar to each other than the points across the moons.</Paragraph> <Paragraph position="2"> Figure 1(b) shows the classification result of SVM. Vertical line denotes classification hyperplane, which has the maximum separating margin with respect to the labeled points in two classes. We can see that SVM does not work well when labeled data can not reveal the structure (two moon pattern) in each class. The reason is that the classification hyperplane was learned only from labeled data. In other words, the coherent structure (two-moon pattern) in unlabeled data was not explored when inferring class boundary.</Paragraph> <Paragraph position="3"> Figure 1(c) shows bootstrapping procedure using kNN (k=1) as base classifier with user-specified parameter b = 1 (the number of added examples from unlabeled data into classified data for each class in each iteration). Termination condition is that the distance between labeled and unlabeled points is more than inter-class distance (the distance between A0 and B0). Each arrow in Figure 1(c) represents one classification operation in each iteration for each class. After eight iterations, A1 [?] A8 were tagged as +1, and B1 [?] B8 were tagged as [?]1, while A9 [?] A10 and B9 [?] B10 were still untagged. Then at the ninth iteration, A9 was tagged as +1 since the label of A9 was determined only by labeled points in kNN model: A9 is closer to any point in {A0 [?] A8} than to any point in {B0 [?] B8}, regardless of the intrinsic structure in data: A9 [?] A10 and B9 [?] B10 are closer to points in lower moon than to points in upper moon. In other words, bootstrapping method uses the unlabeled data under a local consistency based strategy. This is the reason that two points A9 and A10 are misclassified (shown in Figure 1(c)).</Paragraph> <Paragraph position="4"> From above analysis we see that both SVM and bootstrapping are based on a local consistency assumption. null Finally we ran LP on a connected graph-minimum spanning tree generated for this dataset, shown in Figure 2(a). A, B, C represent three points, and the edge A [?] B connects the two moons. Figure 2(b)- 2(f) shows the convergence process of LP with t increasing from 1 to 100. When t = 1, label information of labeled data was pushed to only nearby points. After seven iteration steps (t = 7), point B in upper moon was misclassified as [?]1 since it first received label information from point A through the edge connecting two moons. After another three iteration steps (t=10), this misclassified point was re-tagged as +1. The reason of this self-correcting behavior is that with the push of label information from nearby points, the value of YB,+1 became higher than YB,[?]1. In other words, the weight of edge B [?] C is larger than that of edge B [?] A, which makes it easier for +1 label of point C to travel to point B. Finally, when t [?] 12 LP converged to a fixed point, which achieved the ideal classification result.</Paragraph> </Section> </Section> class="xml-element"></Paper>