File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/04/w04-0846_metho.xml
Size: 7,618 bytes
Last Modified: 2025-10-06 14:09:12
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0846"> <Title>Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities</Title> <Section position="3" start_page="0" end_page="2" type="metho"> <SectionTitle> 2 Task Definition and Algorithm Design </SectionTitle> <Paragraph position="0"> Given n mentions of a key word, we first introduce the following symbols.</Paragraph> <Paragraph position="2"> refers to the context similarity between the i -th context and the j -th context, which is a subset of the predefined context similarity features.</Paragraph> <Paragraph position="3"> a f refers to the a -th predefined context similarity feature. So</Paragraph> <Paragraph position="5"> The WSD task is defined as the hard clustering of multiple contexts of the key word. Its final solution is represented as {}MK, where K refers to the number of distinct senses, and M represents the many-to-one mapping (from contexts to a cluster) such that () K]. [1,j n],[1,i j,iM [?][?]= For any given context pair, a set of context similarity features are defined. With n mentions of the same key word,</Paragraph> <Paragraph position="7"> [?][?] are computed. The WSD task is formulated as searching for {}MK, which maximizes the following conditional probability: maximizing the joint probability in Eq. (1), which contains a prior probability distribution of WSD, Because there is no prior knowledge available about what solution is preferred, it is reasonable to take an equal distribution as the prior probability distribution. So WSD is equivalent to searching for in Eq. (3), a maximum entropy model is trained. There are two major advantages of this maximum entropy model: i) the model is independent of individual words; ii) the model takes no information independence assumption about the data, and hence is powerful enough to utilize heterogeneous features. With the learned conditional probabilities in Eq. (3), for a given {}MK, candidate, we can compute the conditional probability of Expression (2). In the final step, optimization is performed to search for {}MK, that maximizes the value of Expression (2).</Paragraph> </Section> <Section position="4" start_page="2" end_page="2" type="metho"> <SectionTitle> 3 Maximum Entropy Modeling </SectionTitle> <Paragraph position="0"> This section presents the definition of context similarity features, and how to estimate the generative probabilities of context similarity</Paragraph> <Paragraph position="2"> of-speech (POS) tag. Corpus I is constructed using context pairs involving the same sense of a word.</Paragraph> <Paragraph position="3"> Corpus II is constructed using context pairs that refer to different senses of a word. Each corpus contains about 18,000 context pairs. The instances in the corpora are represented as pairwise context similarities, taking the form of {} generative probabilities by maximum entropy for Corpus I and Corpus II.</Paragraph> <Paragraph position="4"> We now present how to compute the context similarities. Each context contains the following two categories of features: i) Trigger words centering around the key word within a predefined window size equal to 50 tokens to both sides of the key word. Trigger words are learned using the same technique as in [Niu et al. 2003].</Paragraph> <Paragraph position="5"> ii) Parsing relationships associated with the key word automatically decoded by our parser Note that the words that appear in the Senseval-3 lexical sample evaluation are removed in the corpus construction process.</Paragraph> <Paragraph position="6"> InfoXtract [Srihari et al. 2003]. The relationships being utilized are listed below. Noun: subject-of, object-of, complement-of, has-adjective-modifier, has-nounmodifier, modifier-of, possess, possessed-by, appositive-of Verb: has-subject, has-object, hascomplement, has-adverb-modifier, has-prepositional-modifier Adjective: modifier-of, has-adverb-modifier Based on the above context features, the following three categories of context similarity features are defined: (1) Context similarity based on a vector space model using co-occurring trigger words: the trigger words centering around the key word are represented as a vector, and the tf*idf scheme is used to weigh each trigger word. The cosine of the angle between two resulting vectors is used as a context similarity measure.</Paragraph> <Paragraph position="7"> (2) Context similarity based on Latent semantic analysis (LSA) using trigger words: LSA [Deerwester et al. 1990] is a technique used to uncover the underlying semantics based on co-occurrence data. Using LSA, each word is represented as a vector in the semantic space. The trigger words are represented as a vector summation. Then the cosine of the angle between the two resulting vector summations is computed, and used as a context similarity measure.</Paragraph> <Paragraph position="8"> To facilitate the maximum entropy modeling in the later stage, the resulting similarity measure is discretized into 10 integer values. Now the pairwise context similarity is a set of similarity features, e.g.</Paragraph> <Paragraph position="9"> {VSM-Similairty-equal-to-2, LSA-Trigger-Words-Similarity-equal-to-1, LSA-SubjectSimilarity-equal-to-2}. null In addition to the three categories of basic context similarity features defined above, we also define induced context similarity features by combining basic context similarity features using the logical AND operator. With induced features, the context similarity vector in the previous example is represented as The induced features provide direct and fine-grained information, but suffer from less sampling space. To make the computation feasible, we regulate 3 as the maximum number of logical AND in the induced features. Combining basic features and induced features under a smoothing scheme, maximum entropy modeling may achieve optimal performance.</Paragraph> <Paragraph position="10"> Now the maximum entropy modeling can be formulated as follows: given a pairwise context where Z is the normalization factor, f w is the weight associated with feature f . The Iterative Scaling algorithm combined with Monte Carlo simulation [Pietra, Pietra, & Lafferty 1995] is used to train the weights in this generative model. Unlike the commonly used conditional maximum entropy modeling which approximates the feature configuration space as the training corpus [Ratnaparkhi 1998], Monte Carlo techniques are required in the generative modeling to simulate the possible feature configurations. The exponential prior smoothing scheme [Goodman 2003] is adopted. The same training procedure is performed using Corpus I and Corpus II to estimate</Paragraph> </Section> <Section position="5" start_page="2" end_page="2" type="metho"> <SectionTitle> 4 Statistical Annealing </SectionTitle> <Paragraph position="0"> With the maximum entropy modeling presented above, the WSD task is performed as follows: i) for a given set of contexts, the pairwise context similarity measures are computed; ii) for each computed; iii) for a given WSD candidate solution{}MK, , the conditional probability (2) can be computed. Optimization based on statistical annealing (Neal 1993) is used to search for {}MK, which maximizes Expression (2).</Paragraph> <Paragraph position="1"> The optimization process consists of two steps. First, a local optimal solution{} the initial state, statistical annealing is applied to search for the global optimal solution. To reduce the search time, we set the maximum value of K to 5.</Paragraph> </Section> class="xml-element"></Paper>