File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/p04-1080_metho.xml
Size: 15,794 bytes
Last Modified: 2025-10-06 14:09:00
<?xml version="1.0" standalone="yes"?> <Paper uid="P04-1080"> <Title>Learning Word Senses With Feature Selection and Order Identification Capabilities</Title> <Section position="3" start_page="0" end_page="0" type="metho"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Sense disambiguation is essential for many language applications such as machine translation, information retrieval, and speech processing (Ide and V'eronis, 1998). Almost all of sense disambiguation methods are heavily dependant on manually compiled lexical resources. However these lexical resources often miss domain specific word senses, even many new words are not included inside.</Paragraph> <Paragraph position="1"> Learning word senses from free text will help us dispense of outside knowledge source for defining sense by only discriminating senses of words. Another application of word sense learning is to help enriching or even constructing semantic lexicons (Widdows, 2003).</Paragraph> <Paragraph position="2"> The solution of word sense learning is closely related to the interpretation of word senses. Different interpretations of word senses result in different solutions to word sense learning.</Paragraph> <Paragraph position="3"> One interpretation strategy is to treat a word sense as a set of synonyms like synset in WordNet. The committee based word sense discovery algorithm (Pantel and Lin, 2002) followed this strategy, which treated senses as clusters of words occurring in similar contexts. Their algorithm initially discovered tight clusters called committees by grouping top n words similar with target word using average-link clustering. Then the target word was assigned to committees if the similarity between them was above a given threshold. Each committee that the target word belonged to was interpreted as one of its senses.</Paragraph> <Paragraph position="4"> There are two difficulties with this committee based sense learning. The first difficulty is about derivation of feature vectors. A feature for target word here consists of a contextual content word and its grammatical relationship with target word. Acquisition of grammatical relationship depends on the output of a syntactic parser. But for some languages, ex. Chinese, the performance of syntactic parsing is still a problem. The second difficulty with this solution is that two parameters are required to be provided, which control the number of committees and the number of senses of target word.</Paragraph> <Paragraph position="5"> Another interpretation strategy is to treat a word sense as a group of similar contexts of target word.</Paragraph> <Paragraph position="6"> The context group discrimination (CGD) algorithm presented in (Sch&quot;utze, 1998) adopted this strategy. Firstly, their algorithm selected important contextual words using '2 or local frequency criterion. With the '2 based criterion, those contextual words whose occurrence depended on whether the ambiguous word occurred were chosen as features.</Paragraph> <Paragraph position="7"> When using local frequency criterion, their algorithm selected top n most frequent contextual words as features. Then each context of occurrences of target word was represented by second order co-occurrence based context vector. Singular value decomposition (SVD) was conducted to reduce the dimensionality of context vectors. Then the reduced context vectors were grouped into a pre-defined number of clusters whose centroids corresponded to senses of target word.</Paragraph> <Paragraph position="8"> Some observations can be made about their feature selection and clustering procedure. One observation is that their feature selection uses only first order information although the second order co-occurrence data is available. The other observation is about their clustering procedure. Similar with committee based sense discovery algorithm, their clustering procedure also requires the predefinition of cluster number. Their method can capture both coarse-gained and fine-grained sense distinction as the predefined cluster number varies. But from a point of statistical view, there should exist a partitioning of data at which the most reliable, &quot;natural&quot; sense clusters appear.</Paragraph> <Paragraph position="9"> In this paper, we follow the second order representation method for contexts of target word, since it is supposed to be less sparse and more robust than first order information (Sch&quot;utze, 1998). We introduce a cluster validation based unsupervised feature wrapper to remove noises in contextual words, which works by measuring the consistency between cluster structures estimated from disjoint data sub-sets in selected feature space. It is based on the assumption that if selected feature subset is important and complete, cluster structure estimated from data subset in this feature space should be stable and robust against random sampling. After determination of important contextual words, we use a Gaussian mixture model (GMM) based clustering algorithm (Bouman et al., 1998) to estimate cluster structure and cluster number by minimizing Minimum Description Length (MDL) criterion (Rissanen, 1978). We construct several subsets from widely used benchmark corpus as test data. Experimental results show that our algorithm (FSGMM) can find important feature subset, estimate cluster number and achieve better performance compared with CGD algorithm.</Paragraph> <Paragraph position="10"> This paper is organized as follows. In section 2 we will introduce our word sense learning algorithm, which incorporates unsupervised feature selection and model order identification technique.</Paragraph> <Paragraph position="11"> Then we will give out the experimental results of our algorithm and discuss some findings from these results in section 3. Section 4 will be devoted to a brief review of related efforts on word sense discrimination. In section 5 we will conclude our work and suggest some possible improvements.</Paragraph> </Section> <Section position="4" start_page="0" end_page="0" type="metho"> <SectionTitle> 2 Learning Procedure 2.1 Feature selection </SectionTitle> <Paragraph position="0"> Feature selection for word sense learning is to find important contextual words which help to discriminate senses of target word without using class labels in data set. This problem can be generalized as selecting important feature subset in an unsupervised manner. Many unsupervised feature selection algorithms have been presented, which can be categorized as feature filter (Dash et al., 2002; Talavera, 1999) and feature wrapper (Dy and Brodley, 2000; Law et al., 2002; Mitra et al., 2002; Modha and Spangler, 2003).</Paragraph> <Paragraph position="1"> In this paper we propose a cluster validation based unsupervised feature subset evaluation method. Cluster validation has been used to solve model order identification problem (Lange et al., 2002; Levine and Domany, 2001). Table 1 gives out our feature subset evaluation algorithm. If some features in feature subset are noises, the estimated cluster structure on data subset in selected feature space is not stable, which is more likely to be the artifact of random splitting. Then the consistency between cluster structures estimated from disjoint data subsets will be lower. Otherwise the estimated cluster structures should be more consistent. Here we assume that splitting does not eliminate some of the underlying modes in data set.</Paragraph> <Paragraph position="2"> For comparison of different clustering structures, predictors are constructed based on these clustering solutions, then we use these predictors to classify the same data subset. The agreement between class memberships computed by different predictors can be used as the measure of consistency between cluster structures. We use the stability measure (Lange et al., 2002) (given in Table 1) to assess the agreement between class memberships.</Paragraph> <Paragraph position="3"> For each occurrence, one strategy is to construct its second order context vector by summing the vectors of contextual words, then let the feature selection procedure start to work on these second order contextual vectors to select features. However, since the sense associated with a word's occurrence is always determined by very few feature words in its contexts, it is always the case that there exist more noisy words than the real features in the contexts.</Paragraph> <Paragraph position="4"> So, simply summing the contextual word's vectors together may result in noise-dominated second order context vectors.</Paragraph> <Paragraph position="5"> To deal with this problem, we extend the feature selection procedure further to the construction of second order context vectors: to select better feature words in contexts to construct better second order context vectors enabling better feature selection.</Paragraph> <Paragraph position="6"> Since the sense associated with a word's occurrence is always determined by some feature words in its contexts, it is reasonable to suppose that the selected features should cover most of occurrences.</Paragraph> <Paragraph position="7"> Formally, let coverage(D;T) be the coverage rate of the feature set T with respect to a set of contexts D, i.e., the ratio of the number of the occurrences with at least one feature in their local contexts against the total number of occurrences, then we assume that coverage(D;T) , ?. In practice, we set ? = 0:9.</Paragraph> <Paragraph position="8"> This assumption also helps to avoid the bias toward the selection of fewer features, since with fewer features, there are more occurrences without features in contexts, and their context vectors will be zero valued, which tends to result in more stable cluster structure.</Paragraph> <Paragraph position="9"> Let D be a set of local contexts of occurrences of target word, then D = fdigNi=1, where di represents local context of the i-th occurrence, and N is the total number of this word's occurrences.</Paragraph> <Paragraph position="10"> W is used to denote bag of words occurring in context set D, then W = fwigMi=1, where wi denotes a word occurring in D, and M is the total number of different contextual words.</Paragraph> <Paragraph position="11"> Let V denote a M PS M second-order co-occurrence symmetric matrix. Suppose that the i-th , 1 * i * M, row in the second order matrix corresponds to word wi and the j-th , 1 * j * M, column corresponds to word wj, then the entry specified by i-th row and j-th column records the number of times that word wi occurs close to wj in corpus.</Paragraph> <Paragraph position="12"> We use v(wi) to represent the word vector of contextual word wi, which is the i-th row in matrix V . HT is a weight matrix of contextual word subset T, T W. Then each entry hi;j represents the weight of word wj in di, wj 2 T, 1 * i * N. We use binary term weighting method to derive context vectors: hi;j = 1 if word wj occurs in di, otherwise zero.</Paragraph> <Paragraph position="13"> Let CT = fcTi gNi=1 be a set of context vectors in feature space T, where cTi is the context vector of the i-th occurrence. cTi is defined as:</Paragraph> <Paragraph position="15"> The feature subset selection in word set W can be formulated as:</Paragraph> <Paragraph position="17"> subject to coverage(D;T) , ?, where ^T is the optimal feature subset, criterion is the cluster validation based evaluation function (the function in Table 1), q is the resampling frequency for estimate of stability, and coverage(D;T) is the proportion of contexts with occurrences of features in T. This constrained optimization results in a solution which maximizes the criterion and meets the given constraint at the same time. In this paper we use sequential greedy forward floating search (Pudil et al., 1994) in sorted word list based on '2 or local frequency criterion. We set l = 1, m = 1, where l is plus step, and m is take-away step.</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> 2.2 Clustering with order identification </SectionTitle> <Paragraph position="0"> After feature selection, we employ a Gaussian mixture modelling algorithm, Cluster (Bouman et al., Intuitively, for a given feature subset T, we iteratively split data set into disjoint halves, and compute the agreement of clustering solutions estimated from these sets using stability measure. The average of stability over q resampling is the estimation of the score of T.</Paragraph> <Paragraph position="1"> Function criterion(T, H, V , q) Input parameter: feature subset T, weight matrix H, second order co-occurrence matrix V , resampling frequency q; (1) ST = 0; (2) For i = 1 to q do (2:1) Randomly split CT into disjoint halves, denoted as CTA and CTB; (2:2) Estimate GMM parameter and cluster number on CTA using Cluster, and the parameter set is denoted as ^ A;</Paragraph> <Paragraph position="3"> where ... denotes possible permutation relating indices between LA and LB, and cTBi 2 CTB; (3) ST = 1qST ; (4) Return ST ; 1998), to estimate cluster structure and cluster number. Let Y = fyngNn=1 be a set of M dimensional vectors to be modelled by GMM. Assuming that this model has K subclasses, let ...k denote the prior probability of subclass k, ,,k denote the M dimensional mean vector for subclass k, Rk denote the M PSM dimensional covariance matrix for sub-class k, 1 * k * K. The subclass label for pixel yn is represented by xn. MDL criterion is used for GMM parameter estimation and order identification, which is given by:</Paragraph> <Paragraph position="5"> The log likelihood measures the goodness of fit of a model to data sample, while the second term penalizes complex model. This estimator works by attempting to find a model order with minimum code length to describe the data sample Y and parameter set Th.</Paragraph> <Paragraph position="6"> If the cluster number is fixed, the estimation of GMM parameter can be solved using EM algorithm to address this type of incomplete data problem (Dempster et al., 1977). The initialization of mixture parameter (1) is given by:</Paragraph> <Paragraph position="8"> Ko is a given initial subclass number.</Paragraph> <Paragraph position="9"> Then EM algorithm is used to estimate model parameters by minimizing MDL: E-step: re-estimate the expectations based on previous iteration:</Paragraph> <Paragraph position="11"> M-step: estimate the model parameter (i) to maximize the log-likelihood in MDL:</Paragraph> <Paragraph position="13"> For inferring the cluster number, EM algorithm is applied for each value of K, 1 * K * Ko, and the value ^K which minimizes the value of MDL is chosen as the correct cluster number. To make this process more efficient, two cluster pair l and m are selected to minimize the change in MDL criteria when reducing K to K !1. These two clusters l and m are then merged. The resulting parameter set is chosen as an initial condition for EM iteration with K !1 subclasses. This operation will avoid a complete minimization with respect to ..., ,,, and R for each value of K.</Paragraph> <Paragraph position="14"> distribution of each sense.</Paragraph> </Section> <Section position="2" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Word Sense Percentage </SectionTitle> <Paragraph position="0"> hard not easy (difficult) 82.8% (adjective) not soft (metaphoric) 9.6% not soft (physical) 7.6% interest money paid for the use of money 52.4% a share in a company or business 20.4% readiness to give attention 14% advantage, advancement or favor 9.4% activity that one gives attention to 3.6% causing attention to be given to 0.2%</Paragraph> </Section> </Section> class="xml-element"></Paper>