File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/04/w04-0846_intro.xml
Size: 3,941 bytes
Last Modified: 2025-10-06 14:02:33
<?xml version="1.0" standalone="yes"?> <Paper uid="W04-0846"> <Title>Context Clustering for Word Sense Disambiguation Based on Modeling Pairwise Context Similarities</Title> <Section position="2" start_page="0" end_page="0" type="intro"> <SectionTitle> 1 Introduction Word Sense Disambiguation (WSD) is one of the </SectionTitle> <Paragraph position="0"> central problems in Natural Language Processing.</Paragraph> <Paragraph position="1"> The difficulty of this task lies in the fact that context features and the corresponding statistical distribution are different for each individual word.</Paragraph> <Paragraph position="2"> Traditionally, WSD involves modeling the contexts for each word. [Gale et al. 1992] uses the Naive Bayes method for context modeling which requires a manually truthed corpus for each ambiguous word. This causes a serious Knowledge Bottleneck. The situation is worse when considering the domain dependency of word senses. To avoid the Knowledge Bottleneck, unsupervised or weakly supervised learning approaches have been proposed. These include the bootstrapping approach [Yarowsky 1995] and the context clustering approach [Schutze 1998].</Paragraph> <Paragraph position="3"> Although the above unsupervised or weakly supervised learning approaches are less subject to the Knowledge Bottleneck, some weakness exists: i) for each individual keyword, the sense number has to be provided and in the bootstrapping case, seeds for each sense are also required; ii) the modeling usually assumes some form of evidence independency, e.g. the vector space model used in [Schutze 1998] and [Niu et al. 2003]: this limits the performance and its potential enhancement; iii) most WSD systems either use selectional restriction in parsing relations, and/or trigger words which co-occur within a window size of the ambiguous word. We previously at-tempted combining both types of evidence but only achieved limited improvement due to the lack of a proper modeling of information over-lapping [Niu et al. 2003].</Paragraph> <Paragraph position="4"> This paper presents a new algorithm that addresses these problems. A novel context clustering scheme based on modeling the similarities between pairwise contexts at category level is presented in the Bayesian framework. A generative maximum entropy model is then trained to represent the generative probability distribution of pairwise context similarities based on heterogeneous features that cover both co-occurring words and parsing structures. Statistical annealing is used to derive the final context clusters by globally fitting the pairwise context similarities.</Paragraph> <Paragraph position="5"> This new algorithm only requires a limited amount of existing annotated corpus to train the generative maximum entropy model for the entire vocabulary. This capability is based on the observation that a system does not necessarily require training data for word A in order to disambiguate A. The insight is that the correlation regularity between the sense distinction and the context distinction can be captured at category level, independent of individual words.</Paragraph> <Paragraph position="6"> In what follows, Section 2 formulates WSD as a context clustering task based on the pairwise</Paragraph> <Section position="1" start_page="0" end_page="0" type="sub_section"> <SectionTitle> Association for Computational Linguistics </SectionTitle> <Paragraph position="0"> for the Semantic Analysis of Text, Barcelona, Spain, July 2004 SENSEVAL-3: Third International Workshop on the Evaluation of Systems context similarity model. The context clustering algorithm is described in Sections 3 and 4, corresponding to the two key aspects of the algorithm, i.e. the generative maximum entropy modeling and the annealing-based optimization.</Paragraph> <Paragraph position="1"> Section 5 describes benchmarks and conclusion.</Paragraph> </Section> </Section> class="xml-element"></Paper>