File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/97/p97-1007_intro.xml
Size: 5,950 bytes
Last Modified: 2025-10-06 14:06:14
<?xml version="1.0" standalone="yes"?> <Paper uid="P97-1007"> <Title>Combining Unsupervised Lexical Knowledge Methods for Word Sense Disambiguation *</Title> <Section position="3" start_page="0" end_page="48" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> While in English the &quot;lexical bottleneck&quot; problem (Briscoe, 1991) seems to be softened (e.g. WordNet (Miller, 1990), Alvey Lexicon (Grover et al., 1993), COMLEX (Grishman et al., 1994), etc.) there are no available wide range lexicons for natural language processing (NLP) for other languages. Manual construction of lexicons is the most reliable technique for obtaining structured lexicons but is costly and highly time-consuming. This is the reason for many researchers having focused on the massive acquisition of lexical knowledge and semantic information from pre-existing structured lexical resources as automatically as possible.</Paragraph> <Paragraph position="1"> *This research has been partially funded by CICYT TIC96-1243-C03-02 (ITEM project) and the European Comission LE-4003 (EuroWordNet project).</Paragraph> <Paragraph position="2"> As dictionaries are special texts whose subject matter is a language (or a pair of languages in the case of bilingual dictionaries) they provide a wide range of information about words by giving definitions of senses of words, and, doing that, supplying knowledge not just about language, but about the world itself.</Paragraph> <Paragraph position="3"> One of the most important relation to be extracted from machine-readable dictionaries (MRD) is the hyponym/hypernym relation among dictionary senses (e.g. (Amsler, 1981), (Vossen and Serail, 1990) ) not only because of its own importance as the backbone of taxonomies, but also because this relation acts as the support of main inheritance mechanisms helping, thus, the acquisition of other relations and semantic features (Cohen and Loiselle, 1988), providing formal structure and avoiding redundancy in the lexicon (Briscoe et al., 1990). For instance, following the natural chain of dictionary senses described in the Diccionario General Ilustrado de la Lengua Espadola (DGILE, 1987) we can discover that a bonsai is a cultivated plant or bush.</Paragraph> <Paragraph position="4"> bonsai_l_2 planta y arbusto asi cultivado.</Paragraph> <Paragraph position="5"> (bonsai, plant and bush cultivated in that way) The hyponym/hypernym relation appears between the entry word (e.g. bonsai) and the genus term, or the core of the phrase (e.g. planta and arbusto). Thus, usually a dictionary definition is written to employ a genus term combined with differentia which distinguishes the word being defined from other words with the same genus term 1.</Paragraph> <Paragraph position="6"> As lexical ambiguity pervades language in texts, the words used in dictionary are themselves lexically ambiguous. Thus, when constructing complete disambiguated taxonomies, the correct dictionary sense of the genus term must be selected in each dictionary :For other kind of definition patterns not based on genus, a genus-like term was added after studying those patterns.</Paragraph> <Paragraph position="7"> definition, performing what is usually called Word Sense Disambiguation (WSD) 2. In the previous example planta has thirteen senses and arbusto only one.</Paragraph> <Paragraph position="8"> Although a large set of dictionaries have been exploited as lexicM resources, the most widely used monolingual MRD for NLP is LDOCE which was designed for learners of English. It is clear that different dictionaries do not contain the same explicit information. The information placed in LDOCE has allowed to extract other implicit information easily, e.g. taxonomies (Bruce et al., 1992). Does it mean that only highly structured dictionaries like LDOCE are suitable to be exploited to provide lexical resources for NLP systems? We explored this question probing two disparate dictionaries: Diccionario General Ilustrado de la Lengua Espa~ola (DGILE, 1987) for Spanish, and Le Plus Petit Larousse (LPPL, 1980) for French.</Paragraph> <Paragraph position="9"> Both are substantially poorer in coded information than LDOCE (LDOCE, 1987) 3. These dictionaries are very different in number of headwords, polysemy degree, size and length of definitions (c.f. table 1). While DGILE is a good example of a large sized dictionary, LPPL shows to what extent the smallest dictionary is useful.</Paragraph> <Paragraph position="10"> Even if most of the techniques for WSD are presented as stand-alone, it is our belief, following the ideas of (McRoy, 1992), that full-fledged lexical ambiguity resolution should combine several information sources and techniques. This work does not address all the heuristics cited in her paper, but profits from techniques that were at hand, without any claim of them being complete. In fact we use unsupervised techniques, i.e. those that do not require hand-coding of any kind, that draw knowledge from a variety of sources - the source dictionaries, bilingual dictionaries and WordNet - in diverse ways. by frequency, 86% dictionary senses have semantic codes and 44% of dictionary senses have pragmatic codes.</Paragraph> <Paragraph position="11"> This paper tries to proof that using an appropriate method to combine those heuristics we can disambiguate the genus terms with reasonable precision, and thus construct complete taxonomies from any conventional dictionary in any language.</Paragraph> <Paragraph position="12"> This paper is organized as follows. After this short introduction, section 2 shows the methods we have applied. Section 3 describes the test sets and shows the results. Section 4 explains the construction of the lexical knowledge resources used. Section 5 discusses previous work, and finally, section 6 faces some conclusions and comments on future work.</Paragraph> </Section> class="xml-element"></Paper>