File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/94/h94-1046_intro.xml
Size: 2,616 bytes
Last Modified: 2025-10-06 14:05:48
<?xml version="1.0" standalone="yes"?> <Paper uid="H94-1046"> <Title>USING A SEMANTIC CONCORDANCE FOR SENSE IDENTIFICATION</Title> <Section position="3" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> It is generally recognized that systems for automatic seine identification should be evaluated against a null hypothesis. Gale, Church, and Yarowsky \[1\] suggest that the appropriate basis for comparison would be a system that assumes that each word is being used in its most frequently occurring sere. They review the literature on how well word-disambiguation programs perform; as a lower bound, they estimate that the most frequent sense of polysemous words would be correct 75% of the time, and they propose that any sense-identification system that does not give the correct sense of polysemous words more than 75% of the time would not be worth serious consideration.</Paragraph> <Paragraph position="1"> The value of setting such a lower bound is obvious. However, Gale&quot; Church, and Yarowsky \[I\] do not make clear how they determined what the most frequently occurring senses are. In the absence of such information, a case can be made that the lower bound should be given by the proportion of monosemous words in the textual corpus.</Paragraph> <Paragraph position="2"> Although most words in a dictionary have only a single sense&quot; it is the polysemons words that occur most frequently in speech and writing. This is true even when we ignore the small set of highly pelysemous closed-class words (pronouns, prepositions, auxiliary verbs, etc.) that play such an important structural role. For exampie, 82.3% of the opon-class words in WordNet \[2\] are monosemous, but only 27.2% of the open-class words in a sample of 103 passages from the Brown Corpus \[3\] were monosemous.</Paragraph> <Paragraph position="3"> * Hunter College and Graduate School of the City Univendty of New Ytz~k That is to say, 27% of the time no decision would be needed, but for the remaining 73% of the open-class words, the response would have to be &quot;don't know.&quot; This is probably the lowest lower bound anyone would propose, although if the highly pelysemous, very frequently used closed-class words were included, it would be even lower.</Paragraph> <Paragraph position="4"> A better performance figure would result, of course, if, instead of responding &quot;don't know,&quot; the system were to guess. What is the percentage correct that you could expect to obtain by guessing7</Paragraph> </Section> class="xml-element"></Paper>