File Information
File: 05-lr/acl_arc_1_sum/cleansed_text/xml_by_section/intro/91/e91-1040_intro.xml
Size: 3,113 bytes
Last Modified: 2025-10-06 14:05:01
<?xml version="1.0" standalone="yes"?> <Paper uid="E91-1040"> <Title>AN ASSESSMENT OF SEMANTIC INFORMATION AUTOMATICALLY EXTRACTED FROM MACHINE READABLE DICTIONARIES</Title> <Section position="4" start_page="0" end_page="0" type="intro"> <SectionTitle> 1. INTRODUCTION </SectionTitle> <Paragraph position="0"> In recent years, it has become increasingly clear that the limited size of existing computational lexicons and the poverty of the semantic information they contain represents one of the primary bottlenecks in the development of realistic natural language processing (NLP) systems. The need for extensive lexical and semantic databases is evident in the recent initiation of a number of projects to construct massive generic lexicons for NLP (project GENELEX in Europe or EDR in Japan).</Paragraph> <Paragraph position="1"> The manual coustruction of large lexical-semantic databases demands enormous human resources, and there is a growing body of research into the possibility of automatically extracting at least a part of the required lexical and semantic informati'on from everyday dictionaries. Everyday dictionaries are obviously not structured in a way that enables their immediate use in NLP systems, but several Studies have shown that relatively simple procedures can be used to extract taxonomies and various other semantic relations (for example, Amsler, 1980; Calzolari, 1984; Cbodorow, Byrd, and Heidorn, 1985; Markowitz, Ahlswede, and Evens, 1986; Byrd et al., 1987; Nakamura and Nagao, 1988; Vtronis and Ide, 1990~ Klavans, Chodorow, and Wacholder, 1990; Wilks et al., 1990).</Paragraph> <Paragraph position="2"> However, it remains to be seen whether information automatically extracted from dictionaries is sufficiently complete and coherent to be actually usable in NLP systems. Although there is concern over the quality of automatically extracted lexical information, very few empirical studies have attempted to assess it systematically, and those that have done so have been restricted to consideration of the quality of grammatical information (e.g., Akkerman, Masereeuw, and Meijs, 1985). No evaluation of automatically extracted semantic information has been published.</Paragraph> <Paragraph position="3"> The authors would like to thank Lisa Lassck and Anne Gilman for their contribution to this work.</Paragraph> <Paragraph position="4"> In this paper, we report the results of a quantitative evaluation of automatically extracted sernanuc data. Our results show that for any one dictionary, 55-70% of the extracted information is garbled in some way. These results at first call into doubt the validity of automatic extraction from dictionaries. However, in section 4 we show that these results can be dramatically reduced to about 6% by several means--most significantly, by combining the information extracted from five dictionaries. It therefore appears that even if individual dictionaries are an unreliable source of semantic information, multiple dictionaries can play an important role in building large lexical-semantic databases.</Paragraph> </Section> class="xml-element"></Paper>