File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/relat/96/w96-0104_relat.xml
Size: 9,445 bytes
Last Modified: 2025-10-06 14:16:02
<?xml version="1.0" standalone="yes"?> <Paper uid="W96-0104"> <Title>Learning similarity-based word sense disambiguation from sparse data</Title> <Section position="5" start_page="49" end_page="53" type="relat"> <SectionTitle> 3 Related work </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="49" end_page="51" type="sub_section"> <SectionTitle> 3.1 The knowledge acquisition bottleneck </SectionTitle> <Paragraph position="0"> Brown et al. (1991) and Gale et al. (1992) used the translations of the ambiguous word in a bilingual corpus as sense tags. This does not obviate the need for manual work, as producing bilingual corpora requires manual translation work. S Dagan and Itai (1991) used a bilingual lexicon and a monolingual corpus, to save the need for translating the corpus. The problem remains, however, that the word translations do not necessarily overlap with the desired sense distinctions.</Paragraph> <Paragraph position="1"> Schutze (1992) clustered the examples in the training set, and manually assigned each cluster a sense by observing 10-20 members of the cluster. Each sense was usually represented by several clusters. Although this approach significantly decreased the need for manual intervention, about a hundred examples had still to be tagged manually for each word. Moreover, the resulting clusters did not necessarily correspond to the desired sense distinctions.</Paragraph> <Paragraph position="2"> Yarowsky (1992) learned discriminators for each Roget's category, saving the need to separate SMRD's are, of course, also constructed manually, but, unlike bilingual corpora, these are existing resources, made for general use.</Paragraph> <Paragraph position="3"> senses. The asterisk marks the plot for the narcotic sense. The sentence was The American people and their government also woke up too late to the menace drugs posed to the moral structure of their country.</Paragraph> <Paragraph position="4"> The word menace which is a hint for the narcotic sense in this sentence, did not help in the first iteration, because it did not appear in the narcotic feedback set at all. Thus, in iteration 1, the similarity of this sentence to the medicine sense was 0.15, vs. similarity of 0.1 to the narcotic sense. In iteration 2, menace was learned to be similar to other narcotic-related words, yielding a small advantage for the narcotic sense. In iteration 3, further similarity values were updated, and there was a clear advantage to the narcotic sense (0.93, vs. 0.89 for medicine).</Paragraph> <Paragraph position="5"> the training set into senses. However, using such hand-crafted categories usually leads to a coverage problem for specific domains, or for domains other than the one for which the list of categories has been prepared.</Paragraph> <Paragraph position="6"> Using MRDs for WSD was suggested in (Lesk, 1986); several researchers subsequently continued and improved this line of work (Krovetz and Croft, 1989; Guthrie et al., 1991; Veronis and Ide, 1990). Unlike the information in a corpus, the information in the MP~D definitions is presorted into senses. However, as noted above, the MRD definitions alone do not contain enough information to allow reliable disambiguation. Recently, Yarowsky (1995) combined a MIlD and a corpus in a bootstrapping process. In that work, the definition words were used as initial sense indicators, tagging automatically the target word examples containing them. These tagged examples were then used as seed examples in the bootstrapping process. In comparison, we suggest to combine further the corpus and the MRD by use all the corpus examples of the MP~D definition words, instead of those words alone. This yields much more sense-presorted training information.</Paragraph> </Section> <Section position="2" start_page="51" end_page="53" type="sub_section"> <SectionTitle> 3.2 The problem of sparse data </SectionTitle> <Paragraph position="0"> Most previous works define word similarity based on cooccurrence information, and hence face a severe problem of sparse data. Many of the possible cooccurrences are not observed even in a very large corpus (Church and Mercer, 1993). Our algorithm addresses this problem in two ways. First, we replace the an-or-none indicator of cooccurrence by a graded measure of contextual similarity.</Paragraph> <Paragraph position="1"> Our measure of similarity is transitive, allowing two words to be considered similar even if they are neither observed in the same sentence, nor share neighbor words. Second, we extend the training set by adding examples of related words. The performance of our system compares favorably to that of systems trained on sets larger by a factor of 100 (the results described in section 2 were obtained following learning from several dozen examples, in comparison to thousands of examples in other automatic methods).</Paragraph> <Paragraph position="2"> Traditionally, the problem of sparse data is approached by estimating the probability of unobserved cooccurrences using the actual cooccurrences in the training set. This can be done by smoothing the observed frequencies (Church and Mercer, 1993), or by class-based methods (Brown et al., 1991; Pereira and Tishby, 1992; Pereira et ah, 1993; Hirschman, 1986; Resnik, 1992; Brill et ah, 1990; Dagan et al., 1993). In comparison to these approaches, we use similarity information throughout training, and not merely for estimating cooccurrence statistics. This allows the system to learn successfully from very sparse data.</Paragraph> <Paragraph position="3"> A Appendix A.1 Stopping conditions of the iterative algorithm Let fi be the increase in the similarity value in iteration i: fi(X,Y) = simi(X, Y)- simi_l(,Y=, Y) (9) where X, Y can be either words or sentences. For each item X, the algorithm stops updating its similarity values to other items (that is, updating its row in the similarity matrix) in the first iteration that satisfies rnaxyfi(2d, y) < e, where e > 0 is a preset threshold.</Paragraph> <Paragraph position="4"> 1 iterations (oth- According to this stopping condition, the algorithm terminates after at most erwise, in ! iterations with each fi > e, we obtain sim(,Y=, y) > e- ! = 1, in contradiction to upper bound of 1 on the similarity values). 6 We found that the best results are obtained within three iterations. After that, the disambiguation results tend not to change significantly, although the similarity values may continue to increase. Intuitively, the transitive exploration of similarities is exhausted after three iterations. A.2 Word weights In our algorithm, the weight of a word estimates its expected contribution to the disambiguation task, and the extent to which the word is indicative in sentence similarity. The weights do not change with iterations. They are used to reduce the number of features to a manageable size, and to exclude words that are expected to be given unreliable similarity values. The weight of a word is a product of several factors: frequency in the corpus, the bias inherent in the training set, distance from the target word, and part of speech label: 1. Global frequency. Frequent words are less informative of the sense and of the sentence similarity (e.g., the appearance of this in two different sentences does not indicate similarity between them, and does not indicate the sense of any target word). The contribution of frequency is max{0, 1 - freq(w) .... , where maxhxfreq(X) is a function of the five highest frequencies in max~ xtreq~ ,~ ) the corpus. This factor excludes only the most frequent words from further consideration. As long as the frequencies are not very high, it does not label }/Yl whose frequency is twice that of W2 as less informative.</Paragraph> <Paragraph position="5"> SSimilarity sims(X, y) is a non-decreasing function of the number of iteration n, and the similarity values are bounded by 1. Proofs in (Karov and Edelman, 1996).</Paragraph> <Paragraph position="6"> 2. Log likelihood factor. Words that are indicative of the sense usually appear in the training set more than what would have been expected from their frequency in the general corpus. The log likelihood factor captures this tendency. It is computed as</Paragraph> <Paragraph position="8"> where Pr0d\]i ) is estimated from the frequency of W in the entire corpus, and Pr(Wi \[ W) -from the frequency of I'Yi in the training set, given the examples of the current ambiguous word },V (cf. (Gale et al., 1992)). 7 To avoid poor estimation for words with a low count in the training set, we multiply the log likelihood by min{1, co~t(w).~ where count(W) is the 10 J number of occurrences of W in the training set.</Paragraph> <Paragraph position="9"> Part of speech. Each part of speech is assigned an initial weight (1.0 for nouns and 0.6 for verbs).</Paragraph> <Paragraph position="10"> Distance from the target word. Context words that are far from the target word are less indicative than nearby ones. The contribution of this factor is reciprocally related to the normalized distance.</Paragraph> <Paragraph position="11"> The total weight of a word is the product of the above factors, each normalized by the sum of factor(W/, $) factors of the words in the sentence: weight(Wi, S) = ~w esf~Ctor(Wj, s)' where factor(., .) is the weight before normalization.</Paragraph> </Section> </Section> class="xml-element"></Paper>