File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/03/j03-4004_metho.xml
Size: 19,152 bytes
Last Modified: 2025-10-06 14:08:08
<?xml version="1.0" standalone="yes"?> <Paper uid="J03-4004"> <Title>Adjectives Using Automatically Acquired Selectional Preferences</Title> <Section position="5" start_page="644" end_page="648" type="metho"> <SectionTitle> 4. Disambiguation </SectionTitle> <Paragraph position="0"> Nouns, adjectives and verbs are disambiguated by finding the sense (nc, vc,orac) with the maximum probability estimate in the given context. The method disambiguates nouns and verbs to the WordNet synset level and adjectives to a coarse-grained level of WordNet synsets linked by the similar-to relation, as described previously.</Paragraph> <Section position="1" start_page="644" end_page="647" type="sub_section"> <SectionTitle> 4.1 Disambiguating Nouns </SectionTitle> <Paragraph position="0"> Nouns are disambiguated when they occur as subjects or direct objects and when modified by adjectives. We obtain a probability estimate for each nc to which the target noun belongs, using the distribution of the TCM associated with the co-occurring verb or adjective and the grammatical relationship.</Paragraph> <Paragraph position="1"> Li and Abe used TCMs for the task of structural disambiguation. To obtain probability estimates for noun senses occurring at classes beneath hypernyms on the cut, Li and Abe used the probability estimate at the nc prime on the cut divided by the number of ns descendants, as we do when finding G during training, so the probability estimate is shared equally among all nouns in the nc</Paragraph> <Paragraph position="3"> One problem with doing this is that in cases in which the TCM is quite high in the hierarchy, for example, at the entity class, the probability of any ns's occurring under this nc prime on the TCM will be the same and does not allow us to discriminate among senses beneath this level.</Paragraph> <Paragraph position="4"> For the WSD task, we compare the probability estimates at each nc [?] C</Paragraph> <Paragraph position="6"> a noun belongs to several synsets, we compare the probability estimates, given the context, of these synsets. We obtain estimates for each nc by using the probability of will necessarily total the probability at all hyponyms, since the frequency credit of hyponyms is propagated to hypernyms.</Paragraph> <Paragraph position="7"> Thus, to disambiguate a noun occurring in a given relationship with a given verb, the nc [?] C n that gives the largest estimate for p(nc|vc, gr) is taken, where the verb class (vc) is that which maximizes this estimate from C v . The TCM acquired for each vc of the verb in the given gr provides an estimate for p(nc prime |vc, gr), and the estimate for nc is obtained as in equation (16).</Paragraph> <Paragraph position="8"> For example, one target noun was letter, which occurred as the direct object of sign in our parses of the SENSEVAL-2 data. The TCM that maximized the probability estimate for p(nc|vc, direct object) is shown in Figure 5. The noun letter is disambiguated by comparing the probability estimates on the TCM above the five senses of letter multiplied by the proportion of that probability mass attributed to that synset. Although entity has a higher probability on the TCM, compared to matter, which is above the McCarthy and Carroll Disambiguating Using Selectional Preferences (with maximum probability) and grammatical context. This is the highest probability for any of the synsets of letter, and so in this case the correct sense is selected.</Paragraph> </Section> <Section position="2" start_page="647" end_page="647" type="sub_section"> <SectionTitle> 4.2 Disambiguating Verbs and Adjectives </SectionTitle> <Paragraph position="0"> Verbs and adjectives are disambiguated using TCMs to give estimates for p(nc|vc, gr) and p(nc|ac, gr), respectively. These are combined with prior estimates for p(nc|gr) and p(vc|gr) (or p(ac|gr)) using Bayes' rule to give: p(vc|nc, gr)=p(nc|vc, gr)x</Paragraph> <Paragraph position="2"> The prior distributions for p(nc|gr), p(vc|gr) and p(ac|adjnoun) are obtained during the training phase. For the prior distribution over NC, the frequency credit of each noun in the specified gr in the training data is divided by |C n |. The frequency credit attached to a hyponym is propagated to the superordinate hypernyms, and the frequency of a hypernym (nc prime ) totals the frequency at its hyponyms:</Paragraph> <Paragraph position="4"> The distribution over VC is obtained similarly using the troponym relation. For the distribution over AC, the frequency credit for each adjective is divided by the number of synsets to which the adjective belongs, and the credit for an ac is the sum over all the synsets that are members by virtue of the similar-to WordNet link.</Paragraph> <Paragraph position="5"> To disambiguate a verb occurring with a given noun, the vc from C v that gives the largest estimate for p(vc|nc, gr) is taken. The nc for the co-occurring noun is the nc from C n that maximizes this estimate. The estimate for p(nc|vc, gr) is taken as in equation (16) but selecting the vc to maximize the estimate for p(vc|nc, gr) rather than p(nc|vc, gr). An adjective is likewise disambiguated to the ac from all those to which the adjective belongs, using the estimate for p(nc|ac, gr) and selecting the nc that maximizes the p(ac|nc, gr) estimate.</Paragraph> </Section> <Section position="3" start_page="647" end_page="648" type="sub_section"> <SectionTitle> 4.3 Increasing Coverage: One Sense per Discourse </SectionTitle> <Paragraph position="0"> There is a significant limitation to the word tokens that can be disambiguated using selectional preferences, in that they are restricted to those that occur in the specified grammatical relations and in argument head position. Moreover, we have TCMs only for adjective and verb classes in which there was at least one adjective or verb member that met our criteria for training (having no more than a threshold of 10 senses in WordNet and a frequency of 20 or more occurrences in the BNC data in the specified grammatical relationship). We chose not to apply TCMs for disambiguation where we did not have TCMs for one or more classes for the verb or adjective. To increase coverage, we experimented with applying the one-sense-per-discourse (OSPD) heuristic (Gale, Church, and Yarowsky 1992). With this heuristic, a sense tag for a given word is propagated to other occurrences of the same word within the current document in order to increase coverage. When applying the OSPD heuristic, we simply applied a tag for a noun, verb, or adjective to all the other instances of the same word type with the same part of speech in the discourse, provided that only one possible tag for that word was supplied by the selectional preferences for that discourse.</Paragraph> </Section> </Section> <Section position="6" start_page="648" end_page="649" type="metho"> <SectionTitle> 5. Evaluation </SectionTitle> <Paragraph position="0"> We evaluated our system using the SENSEVAL-2 test corpus on the English all-words task (Cotton et al., 2001). We entered a previous version of this system for the SENSEVAL-2 exercise, in three variants, under the names &quot;sussex-sel&quot; (selectional preferences), &quot;sussex-sel-ospd&quot; (with the OSPD heuristic), and &quot;sussex-sel-ospd-ana&quot; (with anaphora resolution).</Paragraph> <Paragraph position="1"> For SENSEVAL-2 we used only the direct object and subject slots, since we had not yet dealt with adjectives. In Figure 6 we show how our system fared at the time of SENSEVAL-2 compared to other unsupervised systems. We have also plotted the results of the supervised systems and the precision and recall achieved by using the most frequent sense (as listed in WordNet).</Paragraph> <Paragraph position="2"> In the work reported here, we attempted disambiguation for head nouns and verbs in subject and direct object relationships, and for adjectives and nouns in adjective-noun relationships. For each test instance, we applied subject preferences before direct object preferences, and direct object preferences before adjective-noun preferences. We also propagated sense tags to test instances not in these relationships by applying the one-sense-per-discourse heuristic.</Paragraph> <Paragraph position="3"> We did not use the SENSEVAL-2 coarse-grained classification, as this was not available at the time when we were acquiring the selectional preferences. We therefore Nouns, verbs, and adjectives 51.1 44.9 Polysemous nouns, verbs, and adjectives 36.8 27.3 do not include in the following the coarse-grained results; they are just slightly better than the fine-grained results, which seems to be typical of other systems. Our latest overall results are shown in Table 1. In this table we show the results both with and without the OSPD heuristic. The results for the English SENSEVAL-2 tasks were generally much lower than those for the original SENSEVAL competition. At the time of the SENSEVAL-2 workshop, this was assumed to be due largely to the use of WordNet as the inventory, as opposed to HECTOR (Atkins 1993), but Palmer, Trang Dang, and Fellbaum (forthcoming) have subsequently shown that, at least for the lexical sample tasks, this was due to a harder selection of words, with a higher average level of polysemy. For three of the most polysemous verbs that overlapped between the English lexical sample for SENSEVAL and SENSEVAL-2, the performance was comparable. Table 2 shows our precision results including use of the OSPD heuristic, broken down by part of speech. Although the precision for nouns is greater than that for verbs, the difference is much less when we remove the trivial monosemous cases. Nouns, verbs, and adjectives all outperform their random baseline for precision, and the difference is more marked when monosemous instances are dropped.</Paragraph> <Paragraph position="4"> Table 3 shows the precision results for polysemous words given the slot and the disambiguation source. Overall, once at least one word token has been disambiguated by the preferences, the OSPD heuristic seems to perform better than the selectional preferences. We can see, however, that although this is certainly true for the nouns, the difference for the adjectives (1.3%) is less marked, and the preferences outperform OSPD for the verbs. It seems that verbs obey the OSPD principle much less than nouns. Also, verbs are best disambiguated by their direct objects, whereas nouns appear to be better disambiguated as subjects and when modified by adjectives.</Paragraph> </Section> <Section position="7" start_page="649" end_page="652" type="metho"> <SectionTitle> 6. Discussion </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="649" end_page="652" type="sub_section"> <SectionTitle> 6.1 Selectional Preferences </SectionTitle> <Paragraph position="0"> The precision of our system compares well with that of other unsupervised systems on the SENSEVAL-2 English all-words task, despite the fact that these other systems use a Polysemous nouns, verbs, and adjectives 33.4 36.0 31.6 44.8 number of different sources of information for disambiguation, rather than selectional preferences in isolation. Light and Greiff (2002) summarize some earlier WSD results for automatically acquired selectional preferences. These results were obtained for three systems (Resnik 1997; Abney and Light 1999; Ciaramita and Johnson 2000) on a training and test data set constructed by Resnik containing nouns occurring as direct objects of 100 verbs that select strongly for their objects.</Paragraph> <Paragraph position="1"> Both the test and training sets were extracted from the section of the Brown corpus within the Penn Treebank and used the treebank parses. The test set comprised the portion of this data within SemCor containing these 100 verbs, and the training set comprised 800,000 words from the Penn Treebank parses of the Brown corpus not within SemCor. All three systems obtained higher precision than the results we report here, with Ciaramita and Johnson's Bayesian belief networks achieving the best accuracy at 51.4%. These results are not comparable with ours, however, for three reasons. First, our results for the direct-object slot are for all verbs in the English all-words task, as opposed to just those selecting strongly for their direct objects. We would expect that WSD results using selectional preferences would be better for the latter class of verbs. Second, we do not use manually produced parses, but the output from our fully automatic shallow parser. Third and finally, the baselines reported for Resnik's test set were higher than those for the all-words task. For Resnik's test data, the random base-line was 28.5%, whereas for the polysemous nouns in the direct-object relation on the all-words task, it was 23.9%. The distribution of senses was also perhaps more skewed for Resnik's test set, since the first sense heuristic was 82.8% (Abney and Light 1999), whereas it was 53.6% for the polysemous direct objects in the all-words task. Although our results do show that the precision for the TCMs compares favorably with that of other unsupervised systems on the English all-words task, it would be worthwhile to compare other selectional preference models on the same data.</Paragraph> <Paragraph position="2"> Although the accuracy of our system is encouraging given that it does not use hand-tagged data, the results are below the level of state-of-the-art supervised systems. Indeed, a system just assigning to each word its most frequent sense as listed in WordNet (the &quot;first-sense heuristic&quot;) would do better than our preference models (and in fact better than the majority of the SENSEVAL-2 English all-words supervised systems). The first-sense heuristic, however, assumes the existence of sense-tagged data that are able to give a definitive first sense. We do not use any first-sense information. Although a modest amount of sense-tagged data is available for English (Miller et al.</Paragraph> <Paragraph position="3"> 1993, Ng and Lee 1996), for other languages with minimal sense-tagged resources, the heuristic is not applicable. Moreover, for some words the predominant sense varies depending on the domain and text type.</Paragraph> <Paragraph position="4"> To quantify this, we carried out an analysis of the polysemous nouns, verbs, and adjectives in SemCor occurring in more than one SemCor file and found that a large proportion of words have a different first sense in different files and also in different genres (see Table 4). For adjectives there seems to be a lot less ambi- null guity (this has also been noted by Krovetz [1998]; the data in SENSEVAL-2 bear this out, with many adjectives occurring only in their first sense. For nouns and verbs, for which the predominant sense is more likely to vary among texts, it would be worthwhile to try to detect words for which using the predominant sense is not a reliable strategy, for example, because the word shows &quot;bursty&quot; topic-related behavior.</Paragraph> <Paragraph position="5"> We therefore examined our disambiguation results to see if there was any pattern in the predicates or arguments that were easily disambiguated themselves or were good disambiguators of the co-occurring word. No particular patterns were evident in this respect, perhaps because of the small size of the test data. There were nouns such as team (precision= ) that did better than average, but whether or not they did better than the first-sense heuristic depends of course on the sense in which they are used. For example, all 10 occurrences of cancer are in the first sense, so the first sense heuristic is impossible to beat in this case. For the test items that are not in their first sense, we beat the first-sense heuristic, but on the other hand, we failed to beat the random baseline. (The random baseline is 21.8% and we obtained 21.4% for these items overall.) Our performance on these items is low probably because they are lower-frequency senses for which there is less evidence in the untagged training corpus (the BNC). We believe that selectional preferences would perform best if they were acquired from similar training data to that for which disambiguation is required. In the future, we plan to investigate our models for WSD in specific domains, such as sport and finance. The senses and frequency distribution of senses for a given domain will in general be quite different from those in a balanced corpus.</Paragraph> <Paragraph position="6"> There are individual words that are not used in the first sense on which our TCM preferences do well, for example sound (precision = ), but there are not enough data to isolate predicates or arguments that are good disambiguators from those that are not. We intend to investigate this issue further with the SENSEVAL-2 lexical sample data, which contains more instances of a smaller number of words.</Paragraph> <Paragraph position="7"> Performance of selectional preferences depends not just on the actual word being disambiguated, but the cohesiveness of the tuple <pred, arg, gr>. We have therefore investigated applying a threshold on the probability of the class (nc, vc,orac) before disambiguation. Figure 7 presents a graph of precision against threshold applied to the probability estimate for the highest-scoring class. We show alongside this the random baseline and the first-sense heuristic for these items. Selectional preferences appear to do better on items for which the probability predicted by our model is higher, but the first-sense heuristic does even better on these. The first sense heuristic, with respect to SemCor, outperforms the selectional preferences when it is averaged over a given text. That seems to be the case overall, but there will be some words and texts for which the first sense from SemCor is not relevant, and use of a threshold on probability, and perhaps a differential between probability of the top-ranked senses suggested by the model, should increase precision.</Paragraph> </Section> <Section position="2" start_page="652" end_page="652" type="sub_section"> <SectionTitle> 6.2 The OSPD Heuristic </SectionTitle> <Paragraph position="0"> In these experiments we applied the OSPD heuristic to increase coverage. One problem in doing this when using a fine-grained classification like WordNet is that although the OSPD heuristic works well for homonyms, it is less accurate for related senses (Krovetz 1998), and this distinction is not made in WordNet. We did, however, find that in SemCor, for the majority of polysemous lemma and file combinations, there was only one sense exhibited (see Table 5). We refrained from using the OSPD in situations in which there was conflicting evidence regarding the appropriate sense for a word type occurring more than once in an individual file. In our experiments the OSPD heuristic increased coverage by 7% and recall by 3%, at a cost of only a 1% decrease in precision.</Paragraph> </Section> </Section> class="xml-element"></Paper>