File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/99/w99-0903_intro.xml
Size: 4,100 bytes
Last Modified: 2025-10-06 14:07:04
<?xml version="1.0" standalone="yes"?> <Paper uid="W99-0903"> <Title>Dual Distributional Verb Sense Disambiguation with Small Corpora and Machine Readable Dictionaries*</Title> <Section position="2" start_page="0" end_page="17" type="intro"> <SectionTitle> 1 Introduction </SectionTitle> <Paragraph position="0"> Much recent research in the field of natural language processing has focused on an empirical, corpus-based approach, and the high accuracy achieved by a corpus-based approach to part-of-speech tagging and parsing has inspired similar approaches to word sense disambiguation. For the most successful approaches to such problems, correctly annotated materials are crucial for training learning-based algorithms. Regardless of whether or not learning is involved, the prevailing evaluation methodology requires correct test sets in order to rigorously assess the quality of algorithms and compare their performance. This seems to require manual tagging of the training corpus with appropriate sense for each occurrence of an ambiguous word. However, in marked contrast to annotated training material for part-of-speech tagging, (a) there is no coarse-level set of sense distinctions widely agreed upon (whereas * This work was supported in part by KISTEP for Soft Science Research project.</Paragraph> <Paragraph position="1"> headword : open 2 sense usage examples open Open the window a bit, please.</Paragraph> <Paragraph position="2"> He opened the door for me to come in.</Paragraph> <Paragraph position="3"> Open the box.</Paragraph> <Paragraph position="4"> start Our chairman opened the conference by welcoming new delegates/ Open a public meeting.</Paragraph> <Paragraph position="5"> part-of-speech tag sets tend to differ in the detail); (b) sense annotation has a comparatively high error rate (Miller, personal communication, reports an upper bound for human annotators of around 90~ for ambiguous cases, using a non-blind evaluation method that may make even this estimate overly optimistic(Resnik, 1997)); (c) in conclusion, a sense-tagged corpus large enough to achieve broad coverage and high accuracy word sense disambiguation is not available at present. This paper describes an unsupervised sense disambiguation system using a POS-tagged corpus and a machine-readable dictionary (MRD). The system we propose circumvents the need for the sense-tagged corpus by using MRD's usage examples as the sense-tagged examples. Because these usage examples show the natural examples for headword's each sense, we can acquire useful sense disambiguation context from them. For example, open has several senses and usage examples for its each sense listed in a dictionary as shown in Table 1. The words within usage examples window, door, box, con#fence, and meeting are useful context for sense disambiguation of open.</Paragraph> <Paragraph position="6"> Another problem that is common for much corpus-based work is data sparseness, and the problem especially severe for work in WSD. First, enormous amounts of text are required to ensure that all senses of a polysemous word are represented, given the vast disparity in frequency among senses. In addition, the many possible co-occurrences for a given polysemous word are unlikely to be found in even a very large corpus, or they occur too infrequently to be significant. In this paper, we propose two methods that attack the problem of data sparseness in W~ using small corpus and dictionary. First, extendi word similarity measures from direct co-occurren, to co-occurrences of co-occurred words, we compl the word similarities using not co-occurred woJ but co-occurred clusters. Second, we acquire IS relations of nouns from the MRD definitions. D tionary definitions of nouns are normally written such a way that one can identify for each headw( (the word being defined), a &quot;genus term&quot; (a w( more general that the headword), and these are lated via an IS-A relation(Amsler, 1979). It is po~, ble to cluster the nouns roughly by the identificati of the IS-A relationship.</Paragraph> </Section> class="xml-element"></Paper>