File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1008_intro.xml
Size: 3,488 bytes
Last Modified: 2025-10-06 14:06:15
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1008"> <Title>Similarity-Based Methods For Word Sense Disambiguation</Title> <Section position="3" start_page="0" end_page="56" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> The problem of data sparseness affects all statistical methods for natural language processing. Even large training sets tend to misrepresent low-probability events, since rare events may not appear in the training corpus at all.</Paragraph> <Paragraph position="1"> We concentrate here on the problem of estimating the probability of unseen word pairs, that is, pairs that do not occur in the training set. Katz's back-off scheme (Katz, 1987), widely used in bigram language modeling, estimates the probability of an unseen bigram by utilizing unigram estimates. This has the undesirable result of assigning unseen bigrams the same probability if they are made up of unigrams of the same frequency.</Paragraph> <Paragraph position="2"> Class-based methods (Brown et al., 1992; Pereira, Tishby, and Lee, 1993; Resnik, 1992) cluster words into classes of similar words, so that one can base the estimate of a word pair's probability on the averaged cooccurrence probability of the classes to which the two words belong. However, a word is therefore modeled by the average behavior of many words, which may cause the given word's idiosyncrasies to be ignored. For instance, the word &quot;red&quot; might well act like a generic color word in most cases, but it has distinctive cooccurrence patterns with respect to words like &quot;apple,&quot; &quot;banana,&quot; and so on.</Paragraph> <Paragraph position="3"> We therefore consider similarity-based estimation schemes that do not require building general word classes. Instead, estimates for the most similar words to a word w are combined; the evidence provided by word w' is weighted by a function of its similarity to w.</Paragraph> <Paragraph position="4"> Dagan, Markus, and Markovitch (1993) propose such a scheme for predicting which unseen cooccurrences are more likely than others.</Paragraph> <Paragraph position="5"> However, their scheme does not assign probabilities. In what follows, we focus on probabilistic similarity-based estimation methods.</Paragraph> <Paragraph position="6"> We compared several such methods, including that of Dagan, Pereira, and Lee (1994) and the cooccurrence smoothing method of Essen and Steinbiss (1992), against classical estimation methods, including that of Katz, in a decision task involving unseen pairs of direct objects and verbs, where unigram frequency was eliminated from being a factor. We found that all the similarity-based schemes performed almost 40% better than back-off, which is expected to yield about 50% accuracy in our experimental setting. Furthermore, a scheme based on the total divergence of empirical dis- null tributions to their average 1 yielded statistically significant improvement in error rate over cooccurrence smoothing.</Paragraph> <Paragraph position="7"> We also investigated the effect of removing extremely low-frequency events from the training set. We found that, in contrast to back-off smoothing, where such events are often discarded from training with little discernible effect, similarity-based smoothing methods suffer noticeable performance degradation when singletons (events that occur exactly once) are omitted.</Paragraph> </Section> class="xml-element"></Paper>