File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/97/p97-1008_metho.xml
Size: 19,539 bytes
Last Modified: 2025-10-06 14:14:31
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1008"> <Title>Similarity-Based Methods For Word Sense Disambiguation</Title> <Section position="4" start_page="56" end_page="59" type="metho"> <SectionTitle> 2 Distributional Similarity Models </SectionTitle> <Paragraph position="0"> We wish to model conditional probability distributions arising from the coocurrence of linguistic objects, typically words, in certain configurations. We thus consider pairs (wl, w2) E Vi x V2 for appropriate sets 1/1 and V2, not necessarily disjoint. In what follows, we use subscript i for the i th element of a pair; thus P(w21wi) is the conditional probability (or rather, some empirical estimate, the true probability being unknown) that a pair has second element w2 given that its first element is wl; and P(wllw2) denotes the probability estimate, according to the base language model, that wl is the first word of a pair given that the second word is w2.</Paragraph> <Paragraph position="1"> P(w) denotes the base estimate for the unigram probability of word w.</Paragraph> <Paragraph position="2"> A similarity-based language model consists of three parts: a scheme for deciding which word pairs require a similarity-based estimate, a method for combining information from similar words, and, of course, a function measuring the similarity between words. We give the details of each of these three parts in the following three sections. We will only be concerned with similarity between words in V1.</Paragraph> <Paragraph position="3"> 1To the best of our &quot;knowledge, this is the first use of this particular distribution dissimilarity function in statistical language processing. The function itself is implicit in earlier work on distributional clustering (Pereira, Tishby, and Lee, 1993}, has been used by Tishby (p.e.) in other distributional similarity work, and, as suggested by Yoav Freund (p.c.), it is related to results of Hoeffding (1965) on the probability that a given sample was drawn from a given joint distribution.</Paragraph> <Section position="1" start_page="56" end_page="57" type="sub_section"> <SectionTitle> 2.1 Discounting and Redistribution </SectionTitle> <Paragraph position="0"> Data sparseness makes the maximum likelihood estimate (MLE) for word pair probabilities unreliable. The MLE for the probability of a word pair (Wl, w2), conditional on the appearance of</Paragraph> <Paragraph position="2"> where c(wl, w2) is the frequency of (wl, w2) in the training corpus and c(wl) is the frequency of wt. However, PML is zero for any unseen word pair, which leads to extremely inaccurate estimates for word pair probabilities.</Paragraph> <Paragraph position="3"> Previous proposals for remedying the above problem (Good, 1953; Jelinek, Mercer, and Roukos, 1992; Katz, 1987; Church and Gale, 1991) adjust the MLE in so that the total probability of seen word pairs is less than one, leaving some probability mass to be redistributed among the unseen pairs. In general, the adjustment involves either interpolation, in which the MLE is used in linear combination with an estimator guaranteed to be nonzero for unseen word pairs, or discounting, in which a reduced MLE is used for seen word pairs, with the probability mass left over from this reduction used to model unseen pairs.</Paragraph> <Paragraph position="4"> The discounting approach is the one adopted by Katz (1987):</Paragraph> <Paragraph position="6"> (2) where Pd represents the Good-Turing discounted estimate (Katz, 1987) for seen word pairs, and Pr denotes the model for probability redistribution among the unseen word pairs. c~(wl) is a normalization factor.</Paragraph> <Paragraph position="7"> Following Dagan, Pereira, and Lee (1994), we modify Katz's formulation by writing Pr(w2\]wl) instead P(w2), enabling us to use similarity-based estimates for unseen word pairs instead of basing the estimate for the pair on unigram frequency P(w2). Observe that similarity estimates are used for unseen word pairs only. We next investigate estimates for Pr(w21wl) derived by averaging information from words that are distributionally similar to Wl.</Paragraph> </Section> <Section position="2" start_page="57" end_page="57" type="sub_section"> <SectionTitle> 2.2 Combining Evidence </SectionTitle> <Paragraph position="0"> Similarity-based models assume that if word w~ is &quot;similar&quot; to word wl, then w~ can yield information about the probability of unseen word pairs involving wl. We use a weighted average of the evidence provided by similar words, where the weight given to a particular word w~ depends on its similarity to wl.</Paragraph> <Paragraph position="1"> More precisely, let W(wl, W~l) denote an increasing function of the similarity between wl and w\[, and let $(Wl) denote the set of words most similar to Wl. Then the general form of similarity model we consider is a W-weighted linear combination of predictions of similar words:</Paragraph> <Paragraph position="3"> malization factor. According to this formula, w2 is more likely to occur with wl if it tends to occur with the words that are most similar to WI.</Paragraph> <Paragraph position="4"> Considerable latitude is allowed in defining the set $(Wx), as is evidenced by previous work that can be put in the above form. Essen and Steinbiss (1992) and Karov and Edelman (1996) (implicitly) set 8(wl) = V1. However, it may be desirable to restrict ,5(wl) in some fashion, especially if 1/1 is large. For instance, Dagan. Pereira, and Lee (1994) use the closest k or fewer words w~ such that the dissimilarity between wl and w~ is less than a threshold value t; k and t are tuned experimentally.</Paragraph> <Paragraph position="5"> Now, we could directly replace P,.(w2\[wl) in the back-off equation (2) with PSIM(W21Wl).</Paragraph> <Paragraph position="6"> However, other variations are possible, such as interpolating with the unigram probability</Paragraph> <Paragraph position="8"> where 7 is determined experimentally (Dagan, Pereira, and Lee, 1994). This represents, in effect, a linear combination of the similarity estimate and the back-off estimate: if 7 -- 1, then we have exactly Katz's back-off scheme.</Paragraph> <Paragraph position="9"> As we focus in this paper on alternatives for PSlM, we will not consider this approach here; that is, for the rest of this paper, Pr(w2\]wl) = PslM(W21wl).</Paragraph> </Section> <Section position="3" start_page="57" end_page="59" type="sub_section"> <SectionTitle> 2.3 Measures of Similarity </SectionTitle> <Paragraph position="0"> We now consider several word similarity functions that can be derived automatically from the statistics of a training corpus, as opposed to functions derived from manually-constructed word classes (Resnik, 1992). All the similarity functions we describe below depend just on the base language model P('I'), not the discounted model /5(.\[.) from Section 2.1 above.</Paragraph> <Paragraph position="1"> Kullback-Leibler (KL) divergence is a standard information-theoretic measure of the dissimilarity between two probability mass functions (Cover and Thomas, 1991). We can apply it to the conditional distribution P(.\[wl) induced by Wl on words in V2:</Paragraph> <Paragraph position="3"> For D(wxHw~l) to be defined it must be the case that P(w2\]w~l) > 0 whenever P(w21wl) > 0. Unfortunately, this will not in general be the case for MLEs based on samples, so we would need smoothed estimates of P(w2\]w~) that redistribute some probability mass to zerofrequency events. However, using smoothed estimates for P(w2\[wl) as well requires a sum over all w2 6 172, which is expensive \['or the large vocabularies under consideration. Given the smoothed denominator distribution, we set l/V(wl, w~) = lO -~D(wlllw'l) , where/3 is a free parameter.</Paragraph> <Paragraph position="4"> 2.3.2 Total divergence to the average A related measure is based on the total KL divergence to the average of the two distributions: null</Paragraph> <Paragraph position="6"> straightforward to show by grouping terms appropriately that</Paragraph> <Paragraph position="8"> is bounded, ranging between 0 and 2log2, and smoothed estimates are not required because probability ratios are not involved. In addition, the calculation of A(wl, w~) requires summing only over those w2 for which P(w2iwJ and P(w2\]w~) are both non-zero, which, for sparse data, makes the computation quite fast.</Paragraph> <Paragraph position="9"> As in the KL divergence case, we set W(Wl, W~l) to be 10 -~A(~'wl).</Paragraph> <Paragraph position="10"> The L1 norm is defined as</Paragraph> <Paragraph position="12"> This last form makes it clear that 0 < L(Wl, w\[) _< 2, with equality if and only if there are no words w2 such that both P(w2lwJ and P(w2lw\[) are strictly positive.</Paragraph> <Paragraph position="13"> Since we require a weighting scheme that is decreasing in L, we set W(wl, w~) = (2 - n(wl, W/l)) fl with fl again free.</Paragraph> <Paragraph position="14"> Essen and Steinbiss (1992) introduced confusion probability 2, which estimates the probability that word w~ can be substituted for word Wl:</Paragraph> <Paragraph position="16"> Unlike the measures described above, wl may not necessarily be the &quot;closest&quot; word to itself, that is, there may exist a word w~ such that Pc(W'l\[Wl ) > Pc(w,\[wl) .</Paragraph> <Paragraph position="17"> The confusion probability can be computed from empirical estimates provided all unigram estimates are nonzero (as we assume throughout). In fact, the use of smoothed estimates like those of Katz's back-off scheme is problematic, because those estimates typically do not preserve consistency with respect to marginal estimates and Bayes's rule. However, using consistent estimates (such as the MLE), we can rewrite Pc as follows:</Paragraph> <Paragraph position="19"> This form reveals another important difference between the confusion probability and the functions D, A, and L described in the previous sections. Those functions rate w~ as similar to wl if, roughly, P(w21w~) is high when P(w21'wj is.</Paragraph> <Paragraph position="20"> Pc(w~\[wl), however, is greater for those w~ for which P(w~, wJ is large when P(w21wJ/P(w2) is. When the ratio P(w21wl)/P(w2) is large, we may think of w2 as being exceptional, since if w2 is infrequent, we do not expect P(w21wJ to be large.</Paragraph> <Paragraph position="21"> Several features of the measures of similarity listed above are summarized in table 1. &quot;Base LM constraints&quot; are conditions that must be satisfied by the probability estimates of the base 2Actually, they present two alternative definitions. We use their model 2-B, which they found yielded the best experimental results.</Paragraph> <Paragraph position="22"> language model. The last column indicates whether the weight W(wl, w~) associated with each similarity function depends on a parameter that needs to be tuned experimentally.</Paragraph> </Section> </Section> <Section position="5" start_page="59" end_page="61" type="metho"> <SectionTitle> 3 Experimental Results </SectionTitle> <Paragraph position="0"> We evaluated the similarity measures listed above on a word sense disambiguation task, in which each method is presented with a noun and two verbs, and decides which verb is more likely to have the noun as a direct object. Thus, we do not measure the absolute quality of the assignment of probabilities, as would be the case in a perplexity evaluation, but rather the relative quality. We are therefore able to ignore constant factors, and so we neither normalize the similarity measures nor calculate the denominator in equation (3).</Paragraph> <Section position="1" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.1 Task: Pseudo-word Sense Disambiguation </SectionTitle> <Paragraph position="0"> In the usual word sense disambiguation problem, the method to be tested is presented with an ambiguous word in some context, and is asked to identify the correct sense of the word from the context. For example, a test instance might be the sentence fragment &quot;robbed the bank&quot;; the disambiguation method must decide whether &quot;bank&quot; refers to a river bank, a savings bank, or perhaps some other alternative.</Paragraph> <Paragraph position="1"> While sense disambiguation is clearly an important task, it presents numerous experimental difficulties. First, the very notion of &quot;sense&quot; is not clearly defined; for instance, dictionaries may provide sense distinctions that are too fine or too coarse for the data at hand. Also, one needs to have training data for which the correct senses have been assigned, which can require considerable human effort.</Paragraph> <Paragraph position="2"> To circumvent these and other difficulties, we set up a pseudo-word disambiguation experiment (Schiitze, 1992; Gale, Church, and Yarowsky, 1992) the general format of which is as follows. We first construct a list of pseudowords, each of which is the combination of two different words in V2. Each word in V2 contributes to exactly one pseudo-word. Then, we replace each w2 in the test set with its corresponding pseudo-word. For example, if we choose to create a pseudo-word out of the words &quot;make&quot; and &quot;take&quot;, we would change the test data like this: make plans =~ {make, take} plans take action =~ {make, take} action The method being tested must choose between the two words that make up the pseudo-word.</Paragraph> </Section> <Section position="2" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.2 Data </SectionTitle> <Paragraph position="0"> We used a statistical part-of-speech tagger (Church, 1988) and pattern matching and concordancing tools (due to David Yarowsky) to identify transitive main verbs and head nouns of the corresponding direct objects in 44 million words of 1988 Associated Press newswire.</Paragraph> <Paragraph position="1"> We selected the noun-verb pairs for the 1000 most frequent nouns in the corpus. These pairs are undoubtedly somewhat noisy given the errors inherent in the part-of-speech tagging and pattern matching.</Paragraph> <Paragraph position="2"> We used 80%, or 587833, of the pairs so derived, for building base bigram language models, reserving 20.o/0 for testing purposes. As some, but not all, of the similarity measures require smoothed language models, we calculated both a Katz back-off language model (P = 15 (equation (2)), with Pr(w2\[wl) = P(w2)), and a maximum-likelihood model (P = PML)- Furthermore, we wished to investigate Katz's claim that one can delete singletons, word pairs that occur only once, from the training set without affecting model performance (Katz, 1987); our training set contained 82407 singletons. We therefore built four base language models, summarized in Table 2.</Paragraph> <Paragraph position="3"> Since we wished to test the effectiveness of using similarity for unseen word cooccurrences, we removed from the test set any verb-object pairs that occurred in the training set; this resulted in 17152 unseen pairs (some occurred multiple times). The unseen pairs were further divided into five equal-sized parts, T1 through :/'5, which formed the basis for fivefold cross-validation: in each of five runs, one of the Ti was used as a performance test set, with the other 4 sets combined into one set used for tuning parameters (if necessary) via a simple grid search. Finally, test pseudo-words were created from pairs of verbs with similar frequencies, so as to control for word frequency in the decision task. We use error rate as our performance metric, defined as (# incorrect choices + (# of ties)/2) of where N was the size of the test corpus. A tie occurs when the two words making up a pseudo-word are deemed equally likely.</Paragraph> </Section> <Section position="3" start_page="59" end_page="59" type="sub_section"> <SectionTitle> 3.3 Baseline Experiments </SectionTitle> <Paragraph position="0"> The performances of the four base language models are shown in table 3. MLE-1 and MLE-ol both have error rates of exactly .5 because the test sets consist of unseen bigrams, which are all assigned a probability of 0 by maximum-likelihood estimates, and thus are all ties for this method. The back-off models BO-1 and BO-ol also perform similarly.</Paragraph> <Paragraph position="1"> Since the back-off models consistently performed worse than the MLE models, we chose to use only the MLE models in our subsequent experiments. Therefore, we only ran comparisons between the measures that could utilize unsmoothed data, namely, the Lt norm, L(wx, w~); the total divergence to the average, A(wx, w~); and the confusion probability, Pc(w~lwx). 3 In the full paper, we give detailed examples showing the different neighborhoods induced by the different measures, which we omit here for reasons of space.</Paragraph> </Section> <Section position="4" start_page="59" end_page="61" type="sub_section"> <SectionTitle> 3.4 Performance of Similarity-Based Methods </SectionTitle> <Paragraph position="0"> Figure 1 shows the results on the five test sets, using MLE-1 as the base language model. The parameter/3 was always set to the optimal value for the corresponding training set. RAND, which is shown for comparison purposes, simply chooses the weights W(wl,w~) randomly.</Paragraph> <Paragraph position="1"> S(wl) was set equal to Vt in all cases.</Paragraph> <Paragraph position="2"> The similarity-based methods consistently outperform the MLE method (which, recall, always has an error rate of .5) and Katz's back-off method (which always had an error rate of about .51) by a huge margin; therefore, we conclude that information from other word pairs is very useful for unseen pairs where unigram frequency is not informative. The similarity-based methods also do much better than RAND, which indicates that it is not enough to simply combine information from other words arbitrarily: it is quite important to take word similarity into account. In all cases, A edged out the other methods. The average improvement in using A instead of Pc is .0082; this difference is significant to the .1 level (p < .085), according to the paired t-test.</Paragraph> <Paragraph position="3"> 3It should be noted, however, that on BO-1 data, KL-divergence performed slightly better than the L1 norm. base language model was MLE-1. The methods, going from left to right, are RAND, Pc, L, and A. The performances shown are for settings offl that were optimal for the corresponding training set. I3 ranged from 4.0 to 4.5 for L and from 10 to 13 for A.</Paragraph> <Paragraph position="4"> The results for the MLE-ol case are depicted in figure 2. Again, we see the similarity-based methods achieving far lower error rates than the MLE, back-off, and RAND methods, and again, A always performed the best. However, with singletons omitted the difference between A and Pc is even greater, the average difference being .024, which is significant to the .01 level (paired t-test).</Paragraph> <Paragraph position="5"> An important observation is that all methods, including RAND, were much more effective if singletons were included in the base language model; thus, in the case of unseen word pairs, Katz's claim that singletons can be safely ignored in the back-off model does not hold for similarity-based models.</Paragraph> </Section> </Section> class="xml-element"></Paper>