File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/metho/98/p98-2228_metho.xml
Size: 13,691 bytes
Last Modified: 2025-10-06 14:15:02
<?xml version="1.0" standalone="yes"?> <Paper uid="P98-2228"> <Title>Word Sense Disambiguation using Optimised Combinations of Knowledge Sources</Title> <Section position="4" start_page="0" end_page="1398" type="metho"> <SectionTitle> 2 Knowledge Sources and Word </SectionTitle> <Paragraph position="0"/> <Section position="1" start_page="0" end_page="1398" type="sub_section"> <SectionTitle> Sense Disambiguation </SectionTitle> <Paragraph position="0"> One further issue must be mentioned, because it is unique to WSD as a task and is at the core of our approach. Unlike other well-known NLP modules, WSD seems to be implementable by a number of apparently different information sources. All the following have been implemented as the basis of experimental WSD at various times: part-of-speech, semantic preferences, collocating items or classes, thesaural or subject areas, dictionary definitions, synonym lists, among others (such as bilingual equivalents in parallel texts). These phenomena seem different, so how can they all be, separately or in combination, informational clues to a single phenomenon, WSD? This is a situation quite unlike syntactic parsing or part-of-speech tagging: in the latter case, for example, one can write a Cherry-style rule tagger or an HMM learning model, but there is no reason the believe these represent different types of information, just different ways of conceptualising and coding it. That seems not to be the case, at first sight, with the many forms of information for WSD.</Paragraph> <Paragraph position="1"> It is odd that this has not been much discussed in the field.</Paragraph> <Paragraph position="2"> In this work, we shall adopt the methodology first explicitly noted in connection with WSD by (McRoy, 1992), and more recently (Ng and Lee, 1996), namely that of bringing together a number of partial sources of information about a phenomenon and combining them in a principled manner. This is in the AI tradition of combining &quot;weak&quot; methods for strong results (usually ascribed to Newell (Newell, 1973)) and used in the CRL-NMSU lexical work on the Eighties (Wilks et al., 1990). We shall, in this paper, offer a system that combines the three types of information listed above (plus part-of-speech filtering) and, more importantly, applies a learning algorithm to determine the optimal combination of such modules for a given word distribution; it being obvious, for example, that thesaural methods work for nouns better than for verbs, and so on.</Paragraph> </Section> </Section> <Section position="5" start_page="1398" end_page="1399" type="metho"> <SectionTitle> 3 The Sense Tagger </SectionTitle> <Paragraph position="0"> We describe a system which is designed to assign sense tags from a lexicon to general text. We use the Longman Dictionary of Contemporary English (LODCE)(Procter, 1978), which contains two levels of sense distinction: the broad homograph level and the more fine-grained level of sense distinction.</Paragraph> <Paragraph position="1"> Our tagger makes use of several modules which perform disambiguation and these are of two types: filters and partial taggers. A filter removes senses from consideration, thereby reducing the complexity of the disambiguation task. Each partial tagger makes use of a different knowledge source from the lexicon and uses it to suggest a set of possible senses for each ambiguous word in context. None of these modules performs the disambiguation alone but they are combined to make use of all of their results.</Paragraph> <Section position="1" start_page="1398" end_page="1398" type="sub_section"> <SectionTitle> 3.1 Preprocessing </SectionTitle> <Paragraph position="0"> Before the filters or partial taggers are applied the text is tokenised, lemmatised, split into sentences and part-of-speech tagged using the Brill part-of-speech tagger (Brill, 1992).</Paragraph> <Paragraph position="1"> Our system disambiguates only the content words in the text 1 (the part-of-speech tags assigned by 1We define content words as nouns, verbs, adjectives and adverbs, prepositions are not included in this class.</Paragraph> <Paragraph position="2"> Brill's tagger are used to decide which are content words).</Paragraph> </Section> <Section position="2" start_page="1398" end_page="1398" type="sub_section"> <SectionTitle> 3.2 Part-of-speech </SectionTitle> <Paragraph position="0"> Previous work (Wilks and Stevenson, 1998) showed that part-of-speech tags can play an important role in the disambiguation of word senses. A small experimentwas carried out on a 1700 word corpus taken from the Wall Street Journal and, using only part-of-speech tags, an attempt was made to find the correct LDOCE homograph for each of the content words in the corpus. The text was part-of-speech tagged using Brill's tagger and homographs whose part-of-speech category did not agree with the tags assigned by Brill's system were removed from consideration.</Paragraph> <Paragraph position="1"> The most frequently occuring of the remaining homographs was chosen as the sense of each word. We found that 92% of content words were assigned the correct homograph compared with manual disambiguation of the same texts.</Paragraph> <Paragraph position="2"> While this method will not help us disambiguate within the homograph, since all senses which combine to form an LDOCE homograph have the same part-of-speech, it will help us to identify the senses completely innapropriate for a given context (when the homograph's part-of-speech disagrees with that assigned by a tagger).</Paragraph> <Paragraph position="3"> It could be reasonably argued that this is a dangerous strategy since, if the part-of-speech tagger made an error, the correct sense could be removed from consideration. As a precaution against this we have designed our system so that if none of the dictionary senses for a given word agree with the part-of-speech tag then they are all kept (none removed from consideration).</Paragraph> <Paragraph position="4"> There is also good evidence from our earlier WSD system (Wilks and Stevenson, 1997) that this approach works well despite the part-of-speech tagging errors, that system's results improved by 14% using this strategy, achieved 88% correct disambiguation to the LDOCE homograph using this strategy but only 74% without it.</Paragraph> </Section> <Section position="3" start_page="1398" end_page="1399" type="sub_section"> <SectionTitle> 3.3 Dictionary Definitions </SectionTitle> <Paragraph position="0"> (Cowie et al., 1992) used simulated annealing to optimise the choice of senses for a text, based upon their textual definition in a dictionary. The optimisation was over a simple count of words in common in definitions, however, this meant that longer definitions were preferred over short ones, since they have more words which can contribute to the overlap, and short definitions or definitions by synonym were correspondingly penalised. We attempted to solve this problem as follows. Instead of each word contributing one we normalise its contribution by the number of words in the definition it came from. The Cowie et. al. implementation returned one sense for each ambiguous word in the sentence, without any indic- null ation of the system's confidence in its choice, but, we have adapted the system to return a set of suggested senses for each ambiguous word in the sentence. We found that the new evaluation function led to an improvement in the algorithm's effectiveness.</Paragraph> </Section> <Section position="4" start_page="1399" end_page="1399" type="sub_section"> <SectionTitle> 3.4 Pragmatic Codes </SectionTitle> <Paragraph position="0"> Our next partial tagger makes use of the hierarchy of LDOCE pragmatic codes which indicate the likely subject area for a sense. Disambiguation is carried out using a modified version of the simulated annealing algorithm, and attempts to optimise the number of pragmatic codes of the same type in the sentence. Rather than processing over single sentences we optimise over entire paragraphs and only for the sense of nouns. We chose this strategy since there is good evidence (Gale et al., 1992) that nouns are best disambiguated by broad contextual considerations, while other parts of speech are resolved by more local factors.</Paragraph> </Section> <Section position="5" start_page="1399" end_page="1399" type="sub_section"> <SectionTitle> 3.5 Selectional Restrictions </SectionTitle> <Paragraph position="0"> LDOCE senses contain simple selectional restrictions for each content word in the dictionary. A set of 35 semantic classes are used, such as S = Human, M = Human male, P = Plant, S -- Solid and so on. Each word sense for a noun is given one of these semantic types, senses for adjectives list the type which they expect for the noun they modify, senses for adverbs the type they expect of their modifier and verbs list between one and three types (depending on their transitivity) which are the expected semantic types of the verb's subject, direct object and indirect object. Grammatical links between verbs, adjectives and adverbs and the head noun of their arguments arer identified using a specially constructed shallow syntactic analyser (Stevenson, 1998).</Paragraph> <Paragraph position="1"> The semantic classes in LDOCE are not provided with a hierarchy, but, Bruce and Guthrie (Bruce and Guthrie, 1992) manually identified hierarchical relations between the semantic classes, constructing them into a hierarchy which we use to resolve the restrictions. We resolve the restrictions by returning, for each word, the set of sense which do not break them (that is, those whose semantic category is at the same, or a lower, level in the hierarchy).</Paragraph> </Section> </Section> <Section position="6" start_page="1399" end_page="1399" type="metho"> <SectionTitle> 4 Combining Knowledge Sources </SectionTitle> <Paragraph position="0"> Since each of our partial taggers suggests only possible senses for each word it is necessary to have some method to combine their results. We trained decision lists (Clark and Niblett, 1989) using a supervised learning approach. Decision lists have already been successfully applied to lexical ambiguity resolution by (Yarowsky, 1995) where they perfromed well.</Paragraph> <Paragraph position="1"> We present the decision list system with a number of training words for which the correct sense is known. For each of the words we supply each of its possible senses (apart from those removed from consideration by the part-of-speech filter (Section 3.2)) within a context consisting of the results from each of the partial taggers, frequency information and 10 simple collocations (first noun/verb/preposition to the left/right and first/second word to the left/right). Each sense is marked as either appropriate (if it is the correct sense given the context) or inappropriate. A learning algorithm infers a decision list which classifies senses as appropriate or inappropriate in context. The partial taggers and filters can then be run over new text and the decision list applied to the results, so as to identify the appropriate senses for words in novel contexts.</Paragraph> <Paragraph position="2"> Although the decision lists are trained on a fixed vocabulary of words this does not limit the decision lists produced to those words, and our system can assign a sense to any word, provided it has a definition in LDOCE. The decision list produced consists of rules such as &quot;if the part-of-speech is a noun and the pragmatic codes partial tagger returned a confident value for that word then that sense is appropriate for the context&quot;.</Paragraph> </Section> <Section position="7" start_page="1399" end_page="1400" type="metho"> <SectionTitle> 5 Producing an Evaluation Corpus </SectionTitle> <Paragraph position="0"> Rather than expend a vast amount of effort on manual tagging we decided to adapt two existing resources to our purposes. We took SEMCOR, a 200,000 word corpus with the content words manually tagged as part of the WordNet project. The semantic tagging was carried out under disciplined conditions using trained lexicographers with tagging inconsistencies between manual annotators controlled. SENSUS (Knight and Luk, 1994) is a large-scale ontology designed for machine-translation and was produced by merging the ontological hierarchies of WordNet and LDOCE (Bruce and Guthrie, 1992). To facilitate this merging it was necessary to derive a mapping between the senses in the two lexical resources. We used this mapping to translate the WordNet-tagged content words in SEMCOR to LDOCE tags.</Paragraph> <Paragraph position="1"> The mapping is not one-to-one, and some Word-Net senses are mapped onto two or three LDOCE senses when the WordNet sense does not distinguish between them. The mapping also contained significant gaps (words and senses not in the translation). SEMCOR contains 91,808 words tagged with Word-Net synsets, 6,071 of which are proper names which we ignore, leaving 85,737 words which could potentially be translated. The translation contains only 36,869 words tagged with LDOCE senses, although this is a reasonable size for an evaluation corpus given this type of task (it is several orders of magnitude larger than those used by (Cowie et al., 1992) (Harley and Glennon, 1997) (Mahesh et al., 1997)).</Paragraph> <Paragraph position="2"> This corpus was also constructed without the excessive cost of additional hand-tagging and does not introduce any inconsistencies which may occur with a poorly controlled tagging strategy.</Paragraph> </Section> class="xml-element"></Paper>